Active Learning from Single and Multiple Annotators

Size: px

Start display at page:

Download "Active Learning from Single and Multiple Annotators"

Augusta Cannon
5 years ago
Views:

1 Active Learning from Single and Multiple Annotators Kamalika Chaudhuri University of California, San Diego Joint work with Chicheng Zhang

2 Classification Given: ( xi, yi ) Vector of features Discrete Labels Find: Prediction rule in a class to predict y from x

3 Challenge: Acquiring Labeled Data Unlabeled data is cheap Labels are expensive

4 Active Learning Given: ( xi, yi ) Find: Prediction rule to predict y from x

5 Active Learning Given: ( xi, yi ) Interactive Label Queries Find: Prediction rule to predict y from x

6 Active Learning Given: ( xi, yi ) Interactive Label Queries Find: Prediction rule to predict y from x using few label queries

7 Why Active Learning Helps? Given: Unlabeled data, interactive label queries Find: Good prediction rule using few label queries

8 Why Active Learning Helps? Given: Unlabeled data, interactive label queries Find: Good prediction rule using few label queries

9 Why Active Learning Helps? Given: Unlabeled data, interactive label queries Find: Good prediction rule using few label queries

10 Why Active Learning Helps? Given: Unlabeled data, interactive label queries Find: Good prediction rule using few label queries

11 Why Active Learning Helps? Given: Unlabeled data, interactive label queries Find: Good prediction rule using few label queries

12 Challenge: Agnostic Active Learning

13 PAC Model: Realizable vs. Agnostic Given: Find: Concept class C, Samples (xi, yi) from D c in C with low error c 2 C Realizable such that c (x) =y, 8(x, y) D Agnostic No Assumptions on D -

14 Agnostic Active Learning Given: Concept class C ( best c in C has error ) ( xi, yi ) drawn from D Interactive Label Queries Find: c in C with error apple + using few label queries with no assumptions on D

15 Why is Agnostic Active Learning hard? Given: Unlabeled data, interactive label queries No assumptions on data distribution Find: Good prediction rule using few label queries

16 Why is Agnostic Active Learning hard? Given: Unlabeled data, interactive label queries No assumptions on data distribution Find: Good prediction rule using few label queries

17 Why is Agnostic Active Learning hard? Given: Unlabeled data, interactive label queries No assumptions on data distribution Find: Good prediction rule using few label queries Statistically Inconsistent: finds local minimum!

18 Talk Outline What is Active Learning? Active Learning from Single Annotator [NIPS 14] Active Learning from Multiple Annotators [NIPS 15]

19 Active Learning from a Single Annotator

20 Previous Work: Disagreement-based Active Learning 1. Maintain candidate set V (that contains best c in C) 2. For unlabeled x, if there exist c1, c2 in V s.t c 1 (x) 6= c 2 (x) then, query label of x, and update V c1 c2 [CAL94, BBL06, DHM07, H07, BDL09, BHLZ10, ] V x

21 Previous Work: Disagreement-based Active Learning 1. Maintain candidate set V (that contains best c in C) 2. For unlabeled x, if there exist c1, c2 in V s.t c 1 (x) 6= c 2 (x) then, query label of x, and update V c1 c2 Pro: Very general V Con: High label requirement x

22 Previous Work: Margin-based Active Learning Geometric algorithm for linear classification wrt uniform and log-concave distribution on sphere [BZ07, BL13, ABL14] Pro: Label requirement Con: Not general

23 Is there a general algorithm with better label complexity?

24 Talk Outline What is Active Learning? Active Learning from Single Annotator [NIPS 14] Previous Work Confidence Based Active Learning

25 Confidence-Rated Predictor (CRP) o ++ + o o o - o + Classifiers that can Abstain

26 Confidence-Rated Predictor (CRP) o ++ + o o o - o + Classifiers that can Abstain Performance Metrics: Error: Rate of mistakes Abstention Rate: Rate of Don t Know predictions

27 CRP with Guaranteed Error [EW10] - o + V - o o o V o Input: Concept set V Unlabeled data U Error guarantee

28 CRP with Guaranteed Error [EW10] - o + V - o o o V o Input: Concept set V Unlabeled data U Error guarantee Goal: Assign label P(z) in {+1, -1, 0} to each z in U Guarantee: For all c in V, Pr (P (z) = z U c(z)) apple

29 Confidence-Based Active Learning Two Key Ideas: 1. A reduction from Active Learning to CRP with guaranteed error! 2. Construction of a CRP with guaranteed error

30 Algorithm Outline 1. Active Learning via CRP 2. How to construct a CRP with guaranteed error

31 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target excess error Goal: Output c in C with target excess error with min #label queries Problem: Which labels to query?

32 Definition: Confidence Sets for c* Recall: c* is the concept in C with min error Confidence Set: Set S of concepts s.t. w.p 1, c* is in S c* S Similar to confidence intervals

33 How to construct confidence sets? Use generalization bounds: Known: 8c,w.p. 1, err(c) cerr(c) apple r VC(C) log(n/ ) n Choose: All c with: cerr(c) apple min c 0 2C cerr(c0 )+ r VC(C) log(n/ ) n (A bit more complicated for active learning)

34 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k Problem: How to query labels?

35 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k Key Idea 1: Run P with error guarantee when P abstains k /64, and query label

36 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k Key Idea 1: Run P with error guarantee when P abstains k /64, and query label Key Idea 2: Query just enough labels to get target excess error k+1 / k on the Don t Knows ( = P s abstention prob.) k

37 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k Key Idea 1: Run P with error guarantee when P abstains k /64, and query label Key Idea 2: Query just enough labels to get target excess error k+1 / k on the Don t Knows ( = P s abstention prob.) Low abstention prob. means less labels queried k

38 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k (b) Run CRP P with V = Vk, unlabeled data, error guarantee = k /64

39 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k (b) Run CRP P with V = Vk, unlabeled data, error guarantee = k /64 (c) Whenever P abstains, query enough labels to get excess error k / k on the Don t Knows ( is P s abstention rate). k

40 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k (b) Run CRP P with V = Vk, unlabeled data, error guarantee = k /64 (c) Whenever P abstains, query enough labels to get excess error k / k on the Don t Knows ( is P s abstention rate). (d) Update: Confidence set Vk+1 (based on new labeled samples) k

41 Active Learning via CRP Given: Unlabeled data U Concept class C Confidence-rate Predictor P Target error 1. Initialize: Confidence set for c* V1 = C, d = VCdim(C). 2. For epoch k = 1, 2, 3,. (a) Target error for epoch k: k = 1 2 k (b) Run CRP P with V = Vk, unlabeled data, error guarantee = k /64 (c) Whenever P abstains, query enough labels to get excess error k / k on the Don t Knows ( is P s abstention rate). (d) Update: Confidence set Vk+1 (based on new labeled samples) 3. Output any c in V log(1/ ) k

42 Algorithm Outline 1. Active Learning via CRP 2. How to construct a CRP with guaranteed error

43 CRP with Guaranteed Error [EW10] - o + V - o o o V o Input: Concept set V Unlabeled data U Error guarantee Goal: Assign label P(z) in {+1, -1, 0} to each z in U Guarantee: For all c in V, Pr (P (z) = z U c(z)) apple

44 CRP with Guaranteed Error Given: Concept set V Error guarantee Unlabeled data U

45 CRP with Guaranteed Error Given: Concept set V Error guarantee Unlabeled data U 1. Construct and solve Linear Program: For each zi in U, variables: i : Prob. of predicting +1 on z i i : Prob. of predicting -1 on zi

46 CRP with Guaranteed Error Given: Concept set V Error guarantee Unlabeled data U 1. Construct and solve Linear Program: For each zi in U, variables: i : Prob. of predicting +1 on z i i : Prob. of predicting -1 on zi max X i i + i U * Non-abstention rate

47 CRP with Guaranteed Error Given: Concept set V Error guarantee Unlabeled data U 1. Construct and solve Linear Program: For each zi in U, variables: i : Prob. of predicting +1 on z i i : Prob. of predicting -1 on zi s.t: max X i 8c 2 V, i + X i i:c(z i )=1 i + X i:c(z i )= 1 i apple U U * Non-abstention rate Error guarantee

48 CRP with Guaranteed Error Given: Concept set V Error guarantee Unlabeled data U 1. Construct and solve Linear Program: For each zi in U, variables: i : Prob. of predicting +1 on z i i : Prob. of predicting -1 on zi s.t: max X i 8c 2 V, i + X i i + i:c(z i )=1 i:c(z i )= 1 8i, i + i apple 1, i, i 0 X i apple U U * Non-abstention rate Error guarantee Probability constraint

49 CRP with Guaranteed Error Given: Concept set V Error guarantee Unlabeled data U 1. Construct and solve Linear Program: s.t: max X i 8c 2 V, i + X i i + 8i, i + i apple 1, i, i 0 X i:c(z i )=1 i:c(z i )= 1 i apple U U * Non-abstention rate Error guarantee Probability constraint 2. For zi in U, output: 1 w.p. i, 1 w.p. i, 0ow Note: Optimal abstention rate given concept set V

50 Algorithm Outline 1. Active Learning via CRP 2. How to construct a CRP with guaranteed error

51 Talk Outline What is Active Learning? Active Learning from Single Annotator [NIPS 14] Previous Work Confidence Based Active Learning Performance Guarantees

52 Consistency Guarantees Theorem: If P is a CRP with guaranteed error, then our active learning algorithm is consistent. Note: Any CRP with guaranteed error gives consistency.

53 Label Complexity: Agnostic Case Parameters: Target error VC dimension d Error of best c in C

54 Label Complexity: Agnostic Case Parameters: Target error VC dimension d Error of best c in C Disagreement-based: (2 + )( d( ) 2 log(1/ ) 2 + d log 2 (1/ ))

55 Label Complexity: Agnostic Case Parameters: Target error VC dimension d Error of best c in C Disagreement-based: (2 + )( d( ) 2 log(1/ ) Rate of change of disagreement region 2 + d log 2 (1/ ))

56 Label Complexity: Agnostic Case Parameters: Target error VC dimension d Error of best c in C Disagreement-based: (2 + )( d( ) 2 log(1/ ) Rate of change of disagreement region 2 + d log 2 (1/ )) Our Algorithm: (2 +, /256)( d( ) 2 log(1/ ) 2 + d log 2 (1/ ))

57 Label Complexity: Agnostic Case Parameters: Target error VC dimension d Error of best c in C Disagreement-based: (2 + )( d( ) 2 log(1/ ) 2 Rate of change of disagreement region + d log 2 (1/ )) Our Algorithm: (2 +, /256)( d( ) 2 log(1/ ) Rate of change of DK region Lower than (2 + ) 2 + d log 2 (1/ ))

58 Label Complexity: Agnostic Case Parameters: Target error VC dimension d Error of best c in C Disagreement-based: (2 + )( d( ) 2 log(1/ ) 2 Rate of change of disagreement region + d log 2 (1/ )) Our Algorithm: (2 +, /256)( d( ) 2 log(1/ ) Rate of change of DK region Lower than (2 + ) 2 + d log 2 (1/ )) Passive Learning: d log(1/ ) 2

59 Linear Classification on Logconcave Distribution on Sphere Our Algorithm: (Realizable Case) d log(1/ ) + log(1/ ) log log(1/ ) (matches Margin-based Active Learning [BL13]) Disagreement-based Active Learning: d 3/2 log 2 (1/ )(log d + log log(1/ ))

60 Conclusions Novel reduction from active learning to CRP Exploits non-zero error guarantee for better label reqmt Novel active learning algorithm: Better asymptotic label complexity than disagreement based More general than margin based

61 Talk Outline What is Active Learning? Active Learning from Single Annotator [NIPS 14] Previous Work Confidence Based Active Learning Performance Guarantees Active Learning from Multiple Annotators [NIPS 15]

62 What if we have auxiliary information?..as an extra oracle

63 Oracle and Weak Labeler Oracle: expensive but correct Weak labeler: cheap, sometimes wrong

64 The Model Given: ( xi, yi ) Interactive Label Queries

65 The Model Given: ( xi, yi ) Interactive Label Queries to Oracle O or Weak Labeler W

66 The Model Given: ( xi, yi ) Interactive Label Queries to Oracle O or Weak Labeler W Find: Prediction rule to predict y from x using few label queries to O

67 Formal Model Given: Concept class C (best c has error wrt O) ( xi, yi ) drawn from D

68 Formal Model Given: Concept class C (best c has error wrt O) ( xi, yi ) drawn from D Oracle O and Weak labeler W

69 Formal Model Given: Concept class C (best c has error wrt O) ( xi, yi ) drawn from D Oracle O or Weak labeler W Find: c in C with error apple + wrt O using minimum label queries to O

70 Formal Model Implications Weak labeler W may be biased yo = -1 yw = -1 yo = 1 yw = 1 Labels by O Labels by W

71 Previous Work [UBS12] Explicit assumptions on where W and O differ, provides algorithms in the non-parametric case [MCR14] No explicit assumptions, but applies to online selective linear classification and robust regression This talk: General learning strategy from W and O with no explicit assumptions

72 Talk Outline What is Active Learning? Active Learning from Single Annotator [NIPS 14] Active Learning from Multiple Annotators [NIPS 15] The Model Algorithm

73 How to learn in this model? Main Ideas: Learn a difference classifier h to predict when O and W differ

74 How to learn in this model? Main Ideas: Learn a difference classifier h to predict when O and W differ Use h with standard active learning to decide if we should query O or W

75 Algorithm Outline 1. Draw x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w

76 Algorithm Outline 1. Draw x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w 2. Train difference classifier h in H on { (xi, yi,d) }

77 Algorithm Outline 1. Draw x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w 2. Train difference classifier h in H on { (xi, yi,d) } 3. Run standard disagreement based active learning algorithm A. If A queries the label of x then: if h(x) = 1, query O, else query W

78 Algorithm Outline 1. Draw x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w 2. Train difference classifier h in H on { (xi, yi,d) } 3. Run standard disagreement based active learning algorithm A. If A queries the label of x then: if h(x) = 1, query O, else query W Is this statistically consistent?

79 Key Observation 1 Directly learning difference classifier may lead to inconsistent annotation on target task yo = 1 Actual Labels

80 Key Observation 1 Directly learning difference classifier may lead to inconsistent annotation on target task yw = -1 yo = 1 Actual Labels

81 Key Observation 1 Directly learning difference classifier may lead to inconsistent annotation on target task yw = -1 h* yo = 1 Actual Labels

82 Key Observation 1 Directly learning difference classifier may lead to inconsistent annotation on target task yw = -1 Query O h* h* Query W yo = 1 Actual Labels Annotation using h* as difference classifier

83 Key Observation 1 Directly learning difference classifier may lead to inconsistent annotation on target task yw = -1 y = -1 h* yo = 1 y = 1 Actual Labels Annotation using h* as difference classifier

84 Solution Train a cost-sensitive difference classifier Constrain False Negative (FN) rate as very low yw = -1 h*fn yo = 1 Actual Labels

85 Solution Train a cost-sensitive difference classifier Constrain False Negative (FN) rate as very low yw = -1 Query O h*fn h*fn yo = 1 Query W Actual Labels Annotation using h*fn as difference classifier

86 Solution Train a cost-sensitive difference classifier Constrain False Negative (FN) rate as very low yw = -1 h*fn h*fn yo = 1 y = 1 Actual Labels Annotation using h*fn as difference classifier

87 Algorithm Outline 1. Draw x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w 2. Train difference classifier h in H on { (xi, yi,d) } with false negative (FN) rate apple 3. Run standard disagreement based active learning algorithm A. If A queries the label of x then: if h(x) = 1, query O, else query W Theorem: This is statistically consistent

88 What about label complexity? Label complexity = #label queries to O

89 d 0 #labels to train difference classifier Õ (d = VCdim(H), = target excess error) Can we do better?

90 Key Observation 2 R = disagreement region of current confidence set R Input space

91 Key Observation 2 R = disagreement region of current confidence set Need to learn difference classifier with FN rate R apple / Pr(R) over R Input space

92 Key Observation 2 R = disagreement region of current confidence set Need to learn difference classifier with FN rate R apple / Pr(R) over R Need Õ d 0 Pr(R) labels Input space

93 Key Observation 2 R = disagreement region of current confidence set Need to learn difference classifier with FN rate R apple / Pr(R) over R Need Õ d 0 Pr(R) labels Problem: R keeps changing. Input space

94 Full Algorithm H = difference concept class, d = VCdim(H), = disagreemt coeff For epochs 1, 2, 3,. Target excess error in epoch k k 1/2 k

95 Full Algorithm H = difference concept class, d = VCdim(H), = disagreemt coeff For epochs 1, 2, 3,. Target excess error in epoch k k 1/2 k Õ(d 0 ( + k )/ k ) Draw samples x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w

96 Full Algorithm H = difference concept class, d = VCdim(H), = disagreemt coeff For epochs 1, 2, 3,. Target excess error in epoch k k 1/2 k Õ(d 0 ( + k )/ k ) Draw samples x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w Train difference classifier h on { (xi, yi,d) } with FN rate k / ( + k ) over the current disagreemt region

97 Full Algorithm H = difference concept class, d = VCdim(H), = disagreemt coeff For epochs 1, 2, 3,. Target excess error in epoch k k 1/2 k Õ(d 0 ( + k )/ k ) Draw samples x1,..,xm. For each xi, query O and W. Set: yi,d = 1 if yi,o = yi,w Train difference classifier h on { (xi, yi,d) } with FN rate k / ( + k ) over the current disagreemt region Run disagreement based active learning algorithm A to target excess error k. If A queries the label of x then: if h(x) = 1, query O, else query W

98 Label Complexity Total #labels to train difference classifier Õ d 0 ( + ) How many labels for the rest of active learning?

99 Label Complexity: Assumptions For any r, t, there is a h in H such that: Pr(h(x) = 1,x2 DIS(B(c,r),y O 6= y W ) apple t (Low FN over disagreemt region) Pr(h(x) =1,x2 DIS(B(c,r)) apple (r, t) (Low positives) Note: (r, t) apple Pr(DIS(B(c,r))

100 Label Complexity #labels to train difference classifier Õ #labels for active learning Õ where: d ( ) 2 2 d 0 ( + ) (2 +,O( )) 2 + apple Compare: #labels for disagreemt based active learning: Õ d ( ) 2 2

101 Conclusion Auxiliary oracle may help if there is a good difference classifier in the difference concept class Open Question: What if there is no gold standard?

102 Thank You!

103

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017 Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?