Learning from Corrupted Binary Labels via Class-Probability Estimation

Size: px

Start display at page:

Download "Learning from Corrupted Binary Labels via Class-Probability Estimation"

Arron Miles
5 years ago
Views:

1 Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National University 1 / 57

2 Learning from binary labels +" +" +" #" +" #" #" #" 2 / 57

3 Learning from binary labels?" +" #" #" #" +" +" +" #" 3 / 57

4 Learning from binary labels +" +" +" #" +" #" #" #" 4 / 57

5 Learning from noisy labels #" #" +" +" +" #" +" #" 5 / 57

6 Learning from positive and unlabelled data?"?" +"?" +"?"?"?" 6 / 57

7 Learning from binary labels +" +" +" #" +" #" #" #" nature S D n learner Goal: good classification wrt distribution D 7 / 57

8 Learning from corrupted labels #" #" +" #" #" +" +" +" S D n S D nature corruptor n learner Goal: good classification wrt (unobserved) distribution D 8 / 57

9 Paper summary Can we learn a good classifier from corrupted samples? 9 / 57

10 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! 10 / 57

11 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014) / 57

12 Paper summary Can we learn a good classifier from corrupted samples? Prior work: in special cases (with a rich enough model), yes! can treat samples as if uncorrupted! (Elkan and Noto, 2008), (Zhang and Lee, 2008), (Natarajan et al., 2013), (duplessis and Sugiyama, 2014)... This work: unified treatment via class-probability estimation analysis for general class of corruptions 12 / 57

13 Assumed corruption model 13 / 57

14 Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) 14 / 57

15 Learning from binary labels: distributions Fix instance space X (e.g. R N ) Underlying distribution D over X {±1} Constituent components of D: (P(x),Q(x),p)=(P[X = x Y = 1],P[X = x Y = 1],P[Y = 1]) (M(x),h(x)) =(P[X = x],p[y = 1 X = x]) 15 / 57

16 Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p) Goal: good classification wrt (unobserved) distribution D 16 / 57

17 Learning from corrupted binary labels S D n S D nature corruptor n learner Samples from corrupted distribution D =(P, Q, p), where P =(1 a) P + a Q and p is arbitrary a,b are noise rates Q = b P +(1 b) Q mutually contaminated distributions (Scott et al., 2013) Goal: good classification wrt (unobserved) distribution D 17 / 57

18 Special cases Label noise Labels flipped w.p. r PU learning Observe M instead of Q p = (1 p = arbitrary a =p b = (1 2r) p + r 1 (1 p) p) r 1 p r #" #" +" +" P = 1 P+0 Q Q=M = p P + (1 p) Q?"?" +" #" #" +"?"?" +"?" +"?" 18 / 57

19 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis 19 / 57

20 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. 20 / 57

21 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) 21 / 57

22 Corrupted class-probabilities Structure of corrupted class-probabilities underpins analysis Proposition For any D,D, h(x)=f a,b,p (h(x)) where f a,b,p is strictly monotone for fixed a,b,p. Follows from Bayes rule: h(x) 1 h(x) = p 1 p P(x) Q(x) = p 1 p (1 a) b P(x) Q(x) P(x) Q(x) + a. +(1 b) 22 / 57

23 Corrupted class-probabilities: special cases Label noise PU learning h(x) =(1 2r) h(x) +r h(x) = p h(x) p h(x)+(1 p) p r unknown p unknown (Natarajan et al., 2013) (Ward et al., 2009) 23 / 57

24 Roadmap nature D corruptor D class-prob estimator ĥ classifier Kernel logistic regression 24 / 57

25 Roadmap Exploit monotone relationship between h and h nature D corruptor D class-prob estimator ĥ classifier? Kernel logistic regression 25 / 57

26 Classification with noise rates 26 / 57

27 Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) 27 / 57

28 Class-probabilities and classification Many classification measures optimised by sign(h(x) t) 0-1 error! t = 1 2 Balanced error! t = p F-score! optimal t depends on D I (Lipton et al., 2014, Koyejo et al., 2014) We can relate this to thresholding of h! 28 / 57

29 Corrupted class-probabilities and classification By monotone relationship, h(x) > t () h(x) > f a,b,p (t). Threshold h at f a,b,p (t)! optimal classification on D Can translate into regret bound e.g. for 0-1 loss 29 / 57

30 Story so far Classification scheme requires: h t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) 30 / 57

31 Story so far Classification scheme requires: h! class-probability estimation t a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 31 / 57

32 Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p noise oracle â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 32 / 57

33 Story so far Classification scheme requires: h! class-probability estimation t! if unknown, alternate approach (see poster) a,b,p! can we estimate these? noise estimator? â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 33 / 57

34 Estimating noise rates: some bad news p strongly non-identifiable! p allowed to be arbitrary (e.g. PU learning) a, b non-identifiable without assumptions (Scott et al., 2013) Can we estimate a, b under assumptions? 34 / 57

35 Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability 35 / 57

36 Weak separability assumption Assume that D is weakly separable : min h(x)=0 x2x max x2x h(x)=1 i.e. 9 deterministically + ve and - ve instances weaker than full separability Assumed range of h constrains observed range of h! 36 / 57

37 Estimating noise rates Proposition Pick any weakly separable D. Then, for any D, a = h min (h max p) p (h max h min ) and b = (1 h max) (p h min ) (1 p) (h max h min ) where h min = min x2x h(x) h max = max x2x h(x) a,b can be estimated from corrupted data alone 37 / 57

38 Estimating noise rates: special cases Label noise r = 1 = h min h max p = p h min h max h min PU learning a = 0 b = p = 1 h max p h max 1 p (Elkan and Noto, 2008), (Liu and Tao, 2014) c.f. mixture proportion estimate of (Scott et al., 2013) In these cases, p can be estimated as well 38 / 57

39 Story so far Optimal classification in general requires a, b, p Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 39 / 57

40 Story so far Optimal classification in general requires a, b, p when does f a,b,p (t) not depend on a,b,p? Range of ĥ noise estimator ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 40 / 57

41 Classification without noise rates 41 / 57

42 Balanced error (BER) of classifier Balanced error (BER) of a classifier f : X! {±1} is: BER D (f )= FPRD (f )+FNR D (f ) 2 for false positive and negative rates FPR D (f ),FNR D (f ) average classification performance on each class optimal classifier is sign(h(x) p) 42 / 57

43 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 43 / 57

44 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) 44 / 57

45 BER immunity under corruption Proposition (c.f. (Zhang and Lee, 2008)) For any D,D, and classifier f : X! {±1}, BER D (f )=(1 a b) BER D (f )+ a + b 2 BER-optimal classifiers on clean and corrupted coincide sign(h(x) p)=sign(h(x) p) Minimise clean BER! don t need to know corruption rates! threshold on h does not need a,b,p 45 / 57

46 BER immunity & class-probability estimation Trivially, we also have regret D BER (f )=(1 a b) 1 regret D BER (f ). i.e. good corrupted BER =) good clean BER can make regret D BER (f )! 0 by class-probability estimation Similar result for AUC (see poster) 46 / 57

47 BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, 47 / 57

48 BER immunity under corruption: proof From (Scott et al., 2013), h T FPR D (f ) FNR D (f )i = FPR D (f ) FNR D (f ) apple T 1 b a b 1 a + b a T, and apple 1 1 is an eigenvector of apple 1 b a b 1 a 48 / 57

49 Are other measures immune? BER is only (non-trivial) performance measure for which: corrupted risk = affine transform of clean risk I because of eigenvector interpretation corrupted threshold is independent of a,b,p I because of nature of f a,b,p (see poster) Other performance measures! need (one of) a,b,p 49 / 57

50 Experiments 50 / 57

51 Experimental setup Injected label noise on UCI datasets Estimate corrupted class-probabilities via neural network well-specified if D linearly separable: h(x)=s(hw,xi) =) h(x)=a s(hw,xi)+b Evaluate: reliability of noise estimates BER performance on clean test set I corrupted data used for training and validation 0-1 performance on clean test set (see poster) 51 / 57

52 Experimental results: noise rates Estimated noise rates are generally reliable Bias of Estimate segment Mean Median Bias of Estimate Ground truth noise mnist spambase Mean Median Ground truth noise Bias of Estimate Mean Median Ground truth noise 52 / 57

53 Experimental results: BER immunity Generally, low observed degradation in BER Dataset Noise 1 - AUC (%) BER (%) segment spambase mnist None 0.00 ± ± 0.00 (r +,r )=(0.1,0.0) 0.00 ± ± 0.00 (r +,r )=(0.1,0.2) 0.02 ± ± 0.08 (r +,r )=(0.2,0.4) 0.03 ± ± 0.20 None 2.49 ± ± 0.00 (r +,r )=(0.1,0.0) 2.67 ± ± 0.03 (r +,r )=(0.1,0.2) 3.01 ± ± 0.05 (r +,r )=(0.2,0.4) 4.91 ± ± 0.13 None 0.92 ± ± 0.00 (r +,r )=(0.1,0.0) 0.95 ± ± 0.01 (r +,r )=(0.1,0.2) 0.97 ± ± 0.02 (r +,r )=(0.2,0.4) 1.17 ± ± / 57

54 Conclusion 54 / 57

55 Learning from corrupted binary labels Monotone relationship h(x)=f a,b,p (h(x)) facilitates: Range of ĥ noise estimator Omit for BER ĥ â, ˆb, ˆp nature D corruptor D class-prob estimator ĥ classifier sign(ĥ(x) fâ, ˆb, ˆp (t)) Kernel logistic regression 55 / 57

56 Future work Better noise estimators in special cases? c.f. (Elkan and Noto, 2008) when D separable Fusion with loss transfer (Natarajan et al., 2013) approach assumes noise rates known better for misspecified models? I c.f. non-robustness of convex surrogate minimisation 56 / 57

57 Thanks! 1 1 Drop by the poster for more (Paper ID 69) 57 / 57

Learning from Corrupted Binary Labels via Class-Probability Estimation (Full Version)

Learning from Corrupted Binary Labels via Class-Probability Estimation (Full Version) (Full Version) Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson National ICT Australia and The Australian National University, Canberra The Australian National University and