Learning From Crowds. Presented by: Bei Peng 03/24/15

Size: px

Start display at page:

Download "Learning From Crowds. Presented by: Bei Peng 03/24/15"

Judith Taylor
5 years ago
Views:

1 Learning From Crowds Presented by: Bei Peng 03/24/15 1

2 Supervised Learning Given labeled training data, learn to generalize well on unseen data Binary classification ( ) Multi-class classification ( y = {1,...,K}) Ordinal regression ( y = {1,...,K}, 1 <...<K) y = R Regression ( ) y = {0, 1} D = {x i,y i } N i=1 2

3 Motivation Can we acquire actual label? Expensive or tedious Multiple annotators (possibly noisy) Disagreement among annotators 3

4 Computer-aided Diagnosis (CAD) Interpretation of medical images is never 100% accurate (detection, characterization error) Provide second opinion to radiologists in the detection and diagnostic process Radiologist is the primary reader & decision maker 4

5 Computer-aided Diagnosis (CAD) Build a classifier to predict whether a suspicious region on a medical image is malignant The actual gold standard can only be obtained from a biopsy of the tissue (expensive, invasive) CAD systems are built from labels assigned by multiple radiologists 5

6 Computer-aided Diagnosis (CAD) Each radiologist provides a subjective version of the gold standard - malignant lesions: size, shape, degree of malignancy Substantial variation among different annotators - luminaries, experts, residents, novices 6

7 Crowdsourcing Inexpensive to acquire labels from a large #. of annotators 7

8 Crowdsourcing One challenge - errors among the crowd - the performance of different annotators can vary widely How to evaluate the annotators without actual gold standard? 8

9 Research Questions 1. How to adapt conventional supervised learning algorithms in this scenario? 2. How to evaluate systems when we do not have absolute gold-standard? 3. How reliable/trustworthy is each annotator? 9

Baseline: Majority Voting Use the labels on which the majority of them agree yˆ i = ( 1 if(1/r) P R j=1 yj i > 0.5 0 if(1/r) P R j=1 yj i < 0.

10 Baseline: Majority Voting Use the labels on which the majority of them agree yˆ i = ( 1 if(1/r) P R j=1 yj i > if(1/r) P R j=1 yj i < 0.5 Considering every pair as a separate example Pr y i =1 y 1 i,...,y R i 10 =(1/R) What s the problem of Majority Voting? Assumes all experts are equally good RX j=1 y j i

11 Proposed Approach Present a maximum-likelihood estimator that jointly learns the classifier, the annotator accuracy, and the actual true label The proposed algorithm automatically discovers the best experts and assigns higher weight to them Beta prior Expectation Maximization 11

12 Related Work Smyth (1995) estimated observer error rates based on the results from labeling volcanoes in satellite images of Venus Smyth (1995) first estimated the ground truth and then use the ground truth to learn a classifier Some researchers collect crowd annotations and show that it can be as good as provided by experts 12

13 Contributions estimating the ground truth learning a classifier first ground truth, then classifier learn jointly (superior) more general Bayesian (prior) EM approach 13

14 Binary Classification A two-coin model for annotators Classification model Estimation problem Maximum likelihood estimator The EM algorithm A Bayesian approach 14

15 Sensitivity and Specificity 15

16 Sensitivity and Specificity Sensitivity (true positive rate) - sick people correctly identified as sick Specificity (true negative rate) - healthy people correctly identified as healthy False positive rate - healthy people incorrectly identified as sick False negative rate - sick people incorrectly identified as healthy 16

17 Sensitivity and Specificity sensitivity : specificity : a a + c d b + d 17

18 A two-coin model for annotators Each annotator provides a hidden true label based on two biased coins - if the true label is one, she flips a coin with bias - if the true label is zero, she flips a coin with bias - if gets heads, she keeps the original label Sensitivity for the annotator: j th j := Pr y j =1 y =1 Specificity for the annotator: j th j := Pr y j =0 y =0 j j 18

19 Classification Model Consider linear discriminating functions f w (x) =w T x ŷ = 1 ifw T x 0 ifw T x< w, x 2 R d Logistic regression Pr[y =1 x, w] = 1 1+e wt x 19

20 Learning Problem D = x i,y 1 i,...y R i N i=1 = 1,..., R w = 1,..., R y 1,...y N 20

21 Maximum Likelihood Estimator Pr(D ) = NY Pr y 1 i,...,y R i x i, = {w,, } i=1 Pr(D ) = NY i=1 Pr y 1 i,...,y R i y i =1, Pr[y i =1 x i,w] +Pr y 1 i,...,y R i y i =0, Pr[yi =0 x i,w] 21

22 Maximum Likelihood Estimator Pr y 1 i,...,y R i y i =1, = = RY j=1 RY j=1 Pr hy j i y i =1, ji j y j i 1 j 1 y j i True positive rate False negative rate Pr y 1 i,...,y R i y i =0, = R Y j 1 y j i 1 j y j i j=1 True negative rate False positive rate 22

23 Maximum Likelihood Estimator NY Pr[D ] = [a i p i + b i (1 p i )] i=1 p i := w T x i a i = RY j=1 j y j i 1 j 1 y j i b i = RY j 1 y j i 1 j y j i j=1 23

24 Maximum Likelihood Estimator Maximizing the log-likelihood: ˆ ML = n o ˆ, ˆ,ŵ = argmax {lnpr [D ]} 24

25 Expectation Maximization Iterative procedure establish a particular gold standard measure the performance of the experts refine the gold standard based on performance measure 25

26 Expectation Maximization lnpr [D, y ] = NX i=1 y i lnp i a i +(1 y i ) ln (1 p i ) b i Expectation(E)-step and Maximization(M)-step 26

27 Expectation Maximization(E-step) E {lnpr [D, y ]} = NX µ i lnp i a i +(1 µ i ) ln (1 p i ) b i i=1 µ i = Pr y i =1 y 1 i,...,y R i,x i, µ i / Pr y 1 i,...,y R i y i =1, Pr[y i =1 x i, ] = a i p i a i p i + b i (1 p i ) 27

28 Expectation Maximization(M-step) Based on current estimate µ i and D, the model parameters are then estimated by maximizing the conditional expectation P N j = i=1 µ iy j i P N i=1 µ i j = P N i=1 (1 µ i) 1 y j i P N i=1 (1 µ i) 28

29 Expectation Maximization The EM algorithm is only guaranteed to converge to a local maximum Multiple restarts with different initializations - use majority voting as the initialization for µ i 29

30 Beta Prior Trust a particular expert more than the other one - imposing priors on the sensitivity & specificity of the experts Beta prior 30

31 Discussions Estimate of the gold standard (true label) - soft probabilistic estimate: - y i =1 if µ i (high sensitivity or specificity) The estimated ground truth is a weighted linear combinations of the labels from all the experts µ i = Pr y i =1 y 1 i,...,y R i,x i, The proposed alg. can be used with any classifier that can be trained with soft probabilistic labels 31

32 Discussions Estimate the ground truth without features Handle missing labels from some experts Use probabilistic scores to evaluate classifiers Posterior approximation µ i 32

33 Multi-class classification Categorical data - suppose the radiologist needs to label whether a nodule is a solid, a part-solid, or a ground glass opacity ( ) K 2 j c = j c1,..., j ck j ck := Pr y j = k y = c j ck : probability that the annotator assigns class k to an instance given the true class is c ( j ) ck =1 j sensitivity: j j 11 specificity: j KX k=1

34 Ordinal Regression Categorical data (have an ordering among labels) - suppose the radiologist needs to give a score (from 1 to 5) to indicate how likely she think it is malignant y jc i = 1 ify j i >c 0 otherwise c =1,...,K 1 y jc i i th : the label assigned to the instance by the annotator j th Pr[y i = c] =Pr[y i >c 1] Pr[y i >c] 34

35 Experiments Classification experiments Regression experiments 35

36 Classification Experiments Two CAD and one text data set - digital mammography data (simulate radiologists) - a breast MRI data set (four real radiologists) - Recognizing Textual Entailment data (MTurk, 164 annotators) 36

37 Digital Mammography Use a biopsy-proven data set positive, 1618 negative examples - each instance: a set of 27 morphological features - simulate multiple radiologists based on two-coin model Simultaneously learn: - a logistic-regression classifier - sensitivity and specificity of each radiologist - golden ground truth 37

38 Result Analysis How good is the learnt classifier? How well can we estimate the specificity of each radiologist? How good is the estimated ground truth? 38

39 Receiver Operating Characteristic (ROC) 39

40 Digital Mammography Result 5 radiologists: 2 experts, 3 novices =[0.90, 0.80, 0.57, 0.60, 0.55] =[0.95, 0.85, 0.62, 0.65, 0.58] 40

41 Digital Mammography Result 8 radiologists: 1 expert, 7 novices 41

42 Digital Mammography Result Decoupled estimation: first estimate the ground truth, and then learn the classifier 42

43 Digital Mammography Result Classifier performance Radiologist performance Estimate of actual ground truth Joint Estimation 43

44 Takeaway A probabilistic framework for supervised learning with multiple annotators but no gold standard The algorithm iteratively establishes a gold standard measures the annotator performance refines the gold standard 44

45 Discussion Does this method has any problem? 45

46 46

47 47

Naïve Bayes classification

Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss