Active learning in sequence labeling

Size: px

Start display at page:

Download "Active learning in sequence labeling"

Steven Carroll
6 years ago
Views:

1 Active learning in sequence labeling Tomáš Šabata Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science

2 Table of contents 1. Introduction 2. Active learning 3. Graphical models 4. Active learning in sequence labeling 5. Semi-supervized active learning in sequence labeling 6. Experiment 7. Summary Tomáš Šabata Active learning in sequence labeling 1 / 36

3 Introduction

4 Sequence modeling and labeling problem definition Sequence of states Sequence of observations Obrázek 1: Sequence representation Tomáš Šabata Active learning in sequence labeling 2 / 36

5 Sequence modeling and labeling problem definition 1. Sequence modeling Given a sequence of states/labels and sequence of observations, find a model that the most likely generates the sequences. 2. Sequence labeling Given a sequence of observations, determine an appropriate label/state for each observation. Reducing errors by considering relations. Tomáš Šabata Active learning in sequence labeling 3 / 36

6 Applications Handwriting recognition Facial expression dynamic modeling DNA analysis Part-of-speech tagging Speech recognition Video analysis Tomáš Šabata Active learning in sequence labeling 4 / 36

7 Active learning

8 What is active learning? The quality of labels makes a huge difference. Garbage in, garbage out. Obtaining golden annotation data can be really expensive. Tomáš Šabata Active learning in sequence labeling 5 / 36

9 What is active learning? Obrázek 2: Active learning cycle Tomáš Šabata Active learning in sequence labeling 6 / 36

10 Scenarios Obrázek 3: Active learning scenarios Tomáš Šabata Active learning in sequence labeling 7 / 36

11 AL algorithm Data: L - set of labeled examples U - set of umlabeled examples θ - utility function while stopping criterion is not met do 1. learn model M from L; 2. for all x i U : u xi θ M (x i ); 3. select example x U with the highest utility function u i ; 4. query annotator for label of example x ; 5. move < y, x > to L; end return L Algorithm 1: General pool-based AL framework Tomáš Šabata Active learning in sequence labeling 8 / 36

12 Query strategies frameworks 1. Uncertainty Sampling 2. Query-By-Committee 3. Expected Model Change 4. Expected Error Reduction 5. Variance Reduction 6. Density-Weighted Methods Tomáš Šabata Active learning in sequence labeling 9 / 36

13 Frameworks: Uncertainty Sampling Simplest, most commonly used Intuitive for probabilistic learning model Binary problems: choose intance with posterior probability near to 0.5 Multiclass problems: Least confident - x LC = arg max x 1 P θ (ŷ x) Margin sampling - x M = argmin x P θ (ŷ 1 x) P θ (ŷ 2 x) Entropy - x H = argmax x i P θ(y i x)logp θ (y i x) Tomáš Šabata Active learning in sequence labeling 10 / 36

Entropy - minimizing log-loss LC + Margin - minimizing

14 Frameworks: Uncertainty Sampling Obrázek 4: Uncertainty sampling for three-class classification problem Application dependent Entropy - minimizing log-loss LC + Margin - minimizing classification error Tomáš Šabata Active learning in sequence labeling 11 / 36

15 Frameworks: Query-By-Committee We maintain a committee C = {θ 1,..., θ C } of models trained on L The most informative query is considered to be the instance about which they most disagree. We need to ensure variability of models in the beginning Measure of disagreement: V (y i ) log V (y i ) C C Vote entropy - xve = argmax x i Kullback-Leibler divergence - xkl 1 = argmax C x C c=1 D KL(P θ (c) P C ) Tomáš Šabata Active learning in sequence labeling 12 / 36

16 Frameworks: Expected Model Change Capable for models using gradient based training. Query instance which would cause the largest model change. Use a gradient of the objective function l θ (L) xemc = argmax x i P θ(y i x) l θ (L < y i, x >) Note: l θ (L) should be close to zero therefore we can use an approximation l θ (L < y i, x >) l θ (< y i, x >) Tomáš Šabata Active learning in sequence labeling 13 / 36

17 Frameworks: Expected Error Reduction Estimate the expected future error of a model trained on L < x, y > Methods: Minimizing the excepted 0/1-loss ( x = argmin x i P U ) θ(y i x) u=1 1 P θ +<x,y >(ŷ x (u) i ) Minimizing the excepted log-loss ( x = argmin x i P θ(y i x) U ) u=1 j P θ +<x,y >(y i j x (u) )logp θ +<x,y i >(y j x (u) ) In most cases the most computationally expensive query framework Logistic regression - O(ULG) CRF - O(TM T +2 ULG) Tomáš Šabata Active learning in sequence labeling 14 / 36

18 Frameworks: Variance Reduction Use the bias-variance decomposition E T [(ŷ y) 2 x] = E[(y E[y x]) 2 ] + (E L [ŷ] E[y x]) 2 + E L [(ŷ E L [ŷ]) 2 ] Model dependent framework Tomáš Šabata Active learning in sequence labeling 15 / 36

19 Frameworks: Density-Weighted Methods Informative instances should not only be those which are uncertain, but also those which are representative of the underlying distribution Uses one of other query strategies as base query strategy (e.g. uncertainty sampling) ( 1 x U = argmax x φ A (x) U u=1 sim(x, x (u) ) The metod is more robust to outliers in dataset. ) β Tomáš Šabata Active learning in sequence labeling 16 / 36

20 Active learning problem variants Active Learning for Structured Outputs Instance is not represented by a single feature vector, but rather a structure. e.g.: Sequences, trees, grammars. Active Feature Acquisition Selection of salient unused features e.g.: Medical tests, sensitive information Active Class Selection Learner is allowed to query a known class label, and obtaining each instance incurs a cost. Active Clustering Generate (or subsample) instances in such a way that they self-organize into groupings Try to get less overlap or noise than with random sampling Tomáš Šabata Active learning in sequence labeling 17 / 36

21 Active learning problem variants Active Learning for Structured Outputs Instance is not represented by a single feature vector, but rather a structure. e.g.: Sequences, trees, grammars. Active Feature Acquisition Selection of salient unused features e.g.: Medical tests, sensitive information Active Class Selection Learner is allowed to query a known class label, and obtaining each instance incurs a cost. Active Clustering Generate (or subsample) instances in such a way that they self-organize into groupings Try to get less overlap or noise than with random sampling Tomáš Šabata Active learning in sequence labeling 18 / 36

22 Graphical models

23 Models: Markov model Obrázek 5: Markov model/chain Tomáš Šabata Active learning in sequence labeling 19 / 36

24 Models: Hidden Markov model Obrázek 6: Hidden Markov model Tomáš Šabata Active learning in sequence labeling 20 / 36

25 Models: Hidden Markov model λ = (A, B, π) Set of hidden states Y = {y 1, y 2,..., y N }, set of observable values X = {x 1, x 2,..., x M } Sequence of states Q = q 1 q 2 q 3...q T sequence of outputs O = o 1 o 2 o 3...o T Transition probability matrix A = {a ij } a ij = P(q t = y j q t 1 = y i ), 1 i, j N Emission probability distribution B = {b i,j } b i,j = P(o t = x j q t = y i ), 1 i N, 1 i M Initial probability distribution π = {π i } π i = P(q 1 = y i ), 1 i N Tomáš Šabata Active learning in sequence labeling 21 / 36

26 Models: Conditional random field (linear chain) Obrázek 7: Linear chain conditional random field Discriminative model P(Y X ), we do not explicitly model P(X ). Perform better than HMMs when the true data distribution has higher-order dependencies than the model. P(Y X ) = 1 ( T Z(x) t=1 exp K ) k=1 θ kf k (y t, y t 1, x t ) Z(X ) = ( T Y t=1 exp K ) k=1 θ kf k (y t, y t 1, x t ) Tomáš Šabata Active learning in sequence labeling 22 / 36

27 Models: Summary 1. HMM P(Q, O) T t=1 P(qt qt 1) P(ot qt) 2. CRF P(Q O) 1 Z O ( λ j f j (q t, q t 1) ) T j exp t=1 + µ k g k (q t, o t) k Tomáš Šabata Active learning in sequence labeling 23 / 36

28 Active learning in sequence labeling

29 Uncertainty sampling Least confident x LC = argmax x 1 P θ (ŷ x) Viterbi path Margin sampling x M = argmin x P θ (ŷ 1 x) P θ (ŷ 2 x) N-best aglorithm Entropy Token entropy - x TE = argmax x 1 T T t=1 m=1 Total token entropy Sequence entropy xse = argmax x ŷ N-best sequence entropy xse = argmax x M P θ (y t = m)logp θ (y t = m) P θ (ŷ x)logp θ (ŷ x) ŷ N P θ (ŷ x)logp θ (ŷ x) Tomáš Šabata Active learning in sequence labeling 24 / 36

30 Query by Committee Query-by-bagging (each model has unique modified set L (c) ) Vote entropy - disagreement over Viterbi s paths xve = argmax T M x 1 V (y t,m) T C log V (yt,m) C Kullback Leibler x KL = argmax x 1 T D(θ (c) C) = M m=1 t=1 m=1 T t=1 1 C Non-normalized variants Sequence vote entropy x SVE = argmax x C D KL (θ (c) C) c=1 P θ (c)(y t = m)log P θ (c) (y t=m) P C (y t=m) ŷ N C P(ŷ x, C)logP(ŷ x, C) Sequence Kullback-Leibler x SKL = argmax x 1 C C c=1 log P (ŷ x) P θ(c) C (ŷ x) ŷ N C Tomáš Šabata Active learning in sequence labeling 25 / 36

31 Expected Gradient Length Exceptation over the N-best labelings. xemc = argmax x P θ (ŷ x) l θ (L < y i, x >) ŷ N C Tomáš Šabata Active learning in sequence labeling 26 / 36

32 Expected Error Reduction O(TM T +2 ULG) too expensive :( Tomáš Šabata Active learning in sequence labeling 27 / 36

33 Density-Weighted Methods Information density - solves problem that US and QBC are prone to querying outliers We need a distance measure for sequences. Kullback-Leibler Euclidean distance Cosine distance Drawback: number of required similarity calculations grows quadratically with the number of instances in U. Solution: Precompute them. Tomáš Šabata Active learning in sequence labeling 28 / 36

34 Semi-supervized active learning in sequence labeling

35 Fully supervized vs semi-supervized FuSAL: Sequence is handled as a whole unit. Sequence-wise vs. token-wise utility functions SeSAL Some subsequences can be easily labelled automatically. Decrease labelling effort. Usage of self-training principle. Tomáš Šabata Active learning in sequence labeling 29 / 36

36 Fully supervized general algorithm Data: B - number of examples to be selected L - set of labeled examples U - set of umlabeled examples θ - utility function while stopping criterion is not met do 1. learn model M from L; 2. for all x i U : u xi θ M (x i ); 3. select B examples x i U with highest utility function u i ; 4. annotate sequences using M; 5. query for labels of non-confidential tokens; 6. move newly annotated examples to L; end return L Algorithm 2: General AL framework Tomáš Šabata Active learning in sequence labeling 30 / 36

37 Experiment

38 Problem definition Handwritten letters recognition. Each letter is randomly written in one of 6 fonts Downscaled to 4x4 pixels Letters are orginized in real sentences. Obrázek 8: A Obrázek 9: B Obrázek 10: C Tomáš Šabata Active learning in sequence labeling 31 / 36

Labeled sentences are added to the dataset interatively ( 50 times ).

39 Model and training Linear chain conditional random fields. 3 labeled sentences in train dataset in the begining. Labeled sentences are added to the dataset interatively ( 50 times ). Random choice FuSAL (Least Confident) SeSAL (Least Confident + marginal probability) Tomáš Šabata Active learning in sequence labeling 32 / 36

40 Results Does it even worth it? Tomáš Šabata Active learning in sequence labeling 33 / 36

41 Results Let s look from another point of view. Tomáš Šabata Active learning in sequence labeling 34 / 36

42 Results Tomáš Šabata Active learning in sequence labeling 35 / 36

43 Summary

44 Summary It does not work in all cases. It can be implementation overhead. In the most of cases the active learning helps. It can be applied to different structures like sequences or trees The combination with semi-supervized learning can lead to rapid save of costs. Future work: Different query costs, automatic threshold finding, CT-HMM. Tomáš Šabata Active learning in sequence labeling 36 / 36

45 Thank you for your attention. Questions? Tomáš Šabata Active learning in sequence labeling 36 / 36

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated