A Deep Interpretation of Classifier Chains

Size: px

Start display at page:

Download "A Deep Interpretation of Classifier Chains"

Corey Johns
5 years ago
Views:

1 A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén Aalto University School of Science, Department of Information and Computer Science and Helsinki Institute for Information Technology Finland Leuven, Belgium. Fri. 31st Oct., 2014 J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

2 Multi-label Classification x =, y = [y 1,..., y L ] = [y beach, y sunset, y foliage, y field, y mountain, y urban ] = [1, 0, 1, 0, 0, 0] J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

3 Multi-label Classification Map D input variables (feature attributes) to L output variables (labels). X 1 X 2 X 3... X D Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D???? Build model h, such that ŷ = [ŷ 1,..., ŷ L ] = h( x). J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

4 Binary Relevance (BR) Train L independent models h = (h 1,..., h L ), one for each label, x y 1 y 2 y 4 For x, predict ŷ = [ŷ 1,..., ŷ L ] = [h 1 ( x),..., h L ( x)] = h( x) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

5 Binary Relevance (BR) Train L independent models h = (h 1,..., h L ), one for each label, For x, predict Table : Binary Relevance: Model ŷ 3 = h 3 ( x) X 1 X 2 X 3... X D Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D? ŷ = [ŷ 1,..., ŷ L ] = [h 1 ( x),..., h L ( x)] = h( x) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

6 Binary Relevance (BR) Train L independent models h = (h 1,..., h L ), one for each label, For x, predict Table : Binary Relevance: Model ŷ 3 = h 3 ( x) X 1 X 2 X 3... X D Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D? ŷ = [ŷ 1,..., ŷ L ] = [h 1 ( x),..., h L ( x)] = h( x) Consensus in the literature: should model relationship between labels! J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

7 Classifier Chains (CC) Predictions are cascaded along a chain as additional features x y 1 y 2 y 4 For x, predict ŷ = [ŷ 1,..., ŷ L ] = [h 1 ( x), h 2 ( x, ŷ 1 ),..., h L ( x, ŷ 1,..., ŷ L 1 )] = h( x) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

8 Classifier Chains (CC) Predictions are cascaded along a chain as additional features For x, predict Table : Classifier Chains: Model ŷ 3 = h 3 ( x, ŷ 1, ŷ 2 ) X 1 X 2 X 3... X D Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D ŷ 1 ŷ 2? ŷ = [ŷ 1,..., ŷ L ] = [h 1 ( x), h 2 ( x, ŷ 1 ),..., h L ( x, ŷ 1,..., ŷ L 1 )] = h( x) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

9 Classifier Chains (CC) Predictions are cascaded along a chain as additional features For x, predict Table : Classifier Chains: Model ŷ 3 = h 3 ( x, ŷ 1, ŷ 2 ) X 1 X 2 X 3... X D Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D x 1 x 2 x 3... x D ŷ 1 ŷ 2? ŷ = [ŷ 1,..., ŷ L ] = [h 1 ( x), h 2 ( x, ŷ 1 ),..., h L ( x, ŷ 1,..., ŷ L 1 )] = h( x) Typically better performance than BR, similar running time J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

10 Development of Classifier Chains If we change the order of labels, predictive performance is different. Use the default chain Use several random chains in ensemble Use chains based on performance (good accuracy, but expensive) Order the chain according to some heuristic, J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

11 Development of Classifier Chains If we change the order of labels, predictive performance is different. Use the default chain Use several random chains in ensemble Use chains based on performance (good accuracy, but expensive) Order the chain according to some heuristic, e.g., such as label dependence! J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

12 Improvements to Classifier Chains Measure dependence via empirical co-occurrence frequencies, pruned frequency counts, correlation coefficient, mutual information, tested with statistical significance tests, maximum spanning tree algorithm, probabilistic graphical model software, dependence among errors (to get conditional label dependence), test different chain orders with hold-out set / internal cross validation and search the chain space with Monte Carlo search, simulated annealing, beam search, A* search and then create, fully-cascaded classifier chains, population of chains, partially-connected chains, singly-linked chains, trees, directed graphs, undirected graphs, undirected chains, ensemble of trees, ensemble of graphs J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

13 Improvements to Classifier Chains Measure dependence via empirical co-occurrence frequencies, pruned frequency counts, correlation coefficient, mutual information, tested with statistical significance tests, maximum spanning tree algorithm, probabilistic graphical model software, dependence among errors (to get conditional label dependence), test different chain orders with hold-out set / internal cross validation and search the chain space with Monte Carlo search, simulated annealing, beam search, A* search and then create, fully-cascaded classifier chains, population of chains, partially-connected chains, singly-linked chains, trees, directed graphs, undirected graphs, undirected chains, ensemble of trees, ensemble of graphs Performance still ensemble of random chains (or too expensive) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

14 Classifier Chains: Label Dependence We may find that... Y 1 Y 2 E(Y 2 Y 1 ) = E(Y 2 ) But if, given the input, the labels are independent, X Y 1 Y 2 E(Y 2 Y 1, X ) = E(Y 2 X ) then independent classifiers (BR) classifier chains (CC)? J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

15 Example: The xor Problem Toy problem, or and xor X 1 X 2 Y 1 Y 2 Y Clearly, E(Y 3 Y 1, Y 2, X 1, X 2 ) = E(Y 3 X 1, X 2 ), but... Table : Results: 20 examples, h j := logistic regression, default label order. Measure BR CC Hamming acc Exact match J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

16 Label Dependence Chain Structure? The optimal structure / order is likelihood dependent the one which performs best! Just looking at the labels is not enough We could model dependence between errors, rather than labels, [ɛ 1, ɛ 2,..., ɛ L ] = [(y 1 h 1 (x)) 2,..., (y L h L (x)) 2 ] (to take into account the input) but we have to train h pairwise only (or expensive), and does not inform us of directionality! We could use an undirected graph, but then we lose greedy inference. J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

17 Classifier Chains as a Neural Network From the point of view of Y 3 (xor), x x y 1 x y 2 y 1 y 2 y 1 y 2 = = A hidden layer / feature space projection! does not work if we swap Y 3 (xor) with Y 1 (or) x y 1 J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

18 Classifier Chains as a Neural Network From the point of view of Y 3 (xor), x x y 1 x x y 2 y 1 y 2 y 1 y 2 y 1 y 2 = = A hidden layer / feature space projection! does not work if we swap Y 3 (xor) with Y 1 (or) x y 1 The third graph is enough for all labels! J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

19 Labels are Transformations of the Input Labels are transformations of the input, which we learn from the training data. For example, ŷ 2 = f 2 (x) = σ(w φ) = σ(w [x 1,..., x D, h(x)] x x f 1 ( ) f 2 ( ) f 1 ( ) f 2 ( ) f 3 ( ) For some fk (x),..., f K (x), any label could be learned separately. J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

20 Labels are Transformations of the Input Labels are transformations of the input, which we learn from the training data. For example, ŷ 2 = f 2 (x) = σ(w φ) = σ(w [x 1,..., x D, h(x)] x f 1 ( ) f 2 ( ) f 3 ( ) y 1 y 2 For some fk (x),..., f K (x), any label could be learned separately. J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

21 A Deep Neural Network What if we replace labels with hidden units? If we 1 (expand x, and invert the order of layers) and 2 make linear units, then we are looking at a deep neural network, y 1 y 2 y 1 y 2 z 1 z 2 z 3 z 4 f 1 ( ) f 2 ( ) f 3 ( ) f 4 ( ) z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 5 J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

22 Learning Hidden Units y 1 y 2 z 1 z 2 z 3 z 4 z 1 z 2 z 3 z 4 x 1 x 2 x 3 x 4 x 5 We can learn the hidden layers z [1] and z [2] with Restricted Boltzmann Machines (for example), and then plug in back propagation; or plug in a binary relevance learner ŷ = h(z [2] ) = h(f [2] (f [1] ( x))) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

23 Results Dataset BR CC D CC D BR DBP music (6) (5) (4) (2) (1) scene (6) (4) (3) (1) (7) yeast (5) (2) (1) (6) (3) genbase (3) (1) (1) (5) (5) medical (4) (2) (5) (6) (1) enron (5) (4) (1) (2) (3) avg. rank D h = Deep Structure + h-model on final layer (SVM as base classifier). BR performs poorly (as usual)... but not if we learn higher-level features first! (D BR, DBP) This structure + CC (D CC) performs best of all J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

24 Results Dataset BR CC D CC D BR DBP music (6) (5) (4) (2) (1) scene (6) (4) (3) (1) (7) yeast (5) (2) (1) (6) (3) genbase (3) (1) (1) (5) (5) medical (4) (2) (5) (6) (1) enron (5) (4) (1) (2) (3) avg. rank D h = Deep Structure + h-model on final layer (SVM as base classifier). BR performs poorly (as usual)... but not if we learn higher-level features first! (D BR, DBP) This structure + CC (D CC) performs best of all With the right features, we can use BR even with a linear learner J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

25 Why Mechanism behind classifier chains not well understood Investment in improvements to classifier chains not being rewarded There are disadvantages to classifier chains: complexity with many labels what structure/directionality to use? inflexible (difficult to add/remove labels) J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

26 Reflections / Conclusions Classifier chains work as a deep structure other labels are supervised features We can use binary relevance if final-layer inputs are independent of each other (we did this using multiple layers of feature transforms) This has advantages, such as semi-supervised learning, easier to add and drop labels J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

27 A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén Aalto University School of Science, Department of Information and Computer Science and Helsinki Institute for Information Technology Finland Leuven, Belgium. Fri. 31st Oct., 2014 J. Read and J. Holmén (Aalto University) A Deep Interpretation Classifier Chains Oct 31, / 17

Multi-label learning from batch and streaming data

Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia Introduction x = Binary