Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Size: px

Start display at page:

Download "Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014"

Francis Bryan
5 years ago
Views:

1 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014

2 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. Proceedings of 18 th ICML Brief extensions/comparisons: Extension: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition Discriminative model approach: Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. Machine Learning and Knowledge Discovery

3 Motivation Noisy phenotyping labels for tuberculosis Slightly resistant samples may not exhibit growth Cut-offs for defining resistance are not perfect Sloppy labels such as tasks that require repetitive human labeling Extensions to semi-supervised learning Many situations!

4 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise General framework: generative model y x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y)

5 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise General framework: generative model y x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y) P(ŷ y): probability table p(x y): N(x m y, y ) P(y): p(y=1) = p(y=0) = 1-

6 General framework: generative model y y x ŷ x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y) P(x,y,ŷ) = P(y ŷ)p(x y)p(ŷ) P(ŷ y): probability table p(x y): N(x m y, y ) P(y): p(y=1) = p(y=0) = 1- P(y ŷ): probability table p(x y): N(x m y, y ) P(ŷ): p(ŷ=1) = h p(ŷ=0) = 1- h y 0 1 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise ŷ 0 (1- h0 ) h0 1 h1 (1- h1 )

7 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise Computing the log-likelihood p(y,x) = p(y x)p(x) = p(x y)p(y) Bayes: p(y x) = p(x y) * p(y) p(x) posterior = likelihood * prior evidence Non-noisy case: Noisy case:

8 Computing the log-likelihood Typically, marginalize over the latent variable: P(x ŷ, ) = p(x,y ŷ, ) = p(y x, ŷ, ) * p(x ) Alternative perspective: ln [p (x,y ŷ, )] = L (q, ) + KL(q p) Lower-bound on likelihood function Kullback-Leibler divergence ln [p (x,y ŷ, )] = Σ y q(y)*ln[p(x,y )/q(y)] - Σ y q(y) *ln[p(y x, )/q(y)] =0 iff q(y) = p(y x, )

9 Computing the log-likelihood Alternative perspective: ln [p (x,y ŷ, )] = L (q, ) + KL(q p) ln [p (x,y ŷ, )] = Σ y q(y)*ln[p(x,y )/q(y)] - Σ y q(y) *ln[p(y x, )/q(y)] =0 iff q(y) = p(y x, ) = R Lower-bound on likelihood function Kullback-Leibler divergence Bishop, C.M. Pattern Recognition and Machine Learning

M-step: Optimize R (complete log likelihood) with respect to Iterate until convergence

10 Expectation-maximization algorithm 0. Initialize parameters 1. E-step: Compute the posterior distribution over y (latent variable) 2. M-step: Optimize R (complete log likelihood) with respect to Iterate until convergence Class 1: m 1, 1 Here, modelling as two Gaussian distributions: y = m y, y Class 2: m 2, 2 Image:

11 Expectation-maximization algorithm 0. Initialize parameters 1. E-step: Compute the posterior distribution over y (latent variable) 2. M-step: Optimize R (complete log likelihood) with respect to P(y x, ŷ, old ) = p(x,y ŷ, old ) p(x ŷ, old ) = p(x y,ŷ, old )*p(y ŷ, old ) p(x ŷ, old, y=1) + p(x ŷ, old, y=0) Take derivative of L (lower bound) and set equal to zero; rearrange to get equations Iterate until convergence

12 Expectation-maximization algorithm 2. M-step: Optimize R (complete log likelihood) with respect to Finally: Use parameters derived from E-M algorithm to make classification decisions on new data Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition

13 Expectation-maximization algorithm Eventually, reach local maximum of log likelihood 1., again. Compute lower bound on log likelihood using justcalculated 1. Compute lower bound on log likelihood using old 2. Find new by maximizing lower bound Bishop, C.M. Pattern Recognition and Machine Learning

14 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise Performance Standard formulation Adjusted for noisy labels Ok. Standard formulation Adjusted for noisy labels

15 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise Brief extensions/comparisons: Extension: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition Discriminative model approach: Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications

16 Extension to kernels Via Fisher s discriminant - Idea: best separation occurs when maximize: variance between classes variance within classes - Optimal w is proportional to w (m 0 -m 1 ) - With discriminating hyperplane: w T x Extending to kernels: Projection: At max: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition

Extension to mixtures of Gaussians Steps: - Estimate the number of mixture components by optimizing total log-likelihood - Associate each mixture component to the noisy labels - Optimize mixture

17 Extension to mixtures of Gaussians Steps: - Estimate the number of mixture components by optimizing total log-likelihood - Associate each mixture component to the noisy labels - Optimize mixture density parameters with EM - Associate updated mixture components to class labels - Use to create Bayes classifier Class 1 Class 2 Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition

18 Discriminative model comparison Logistic regression Without noisy labels: L(w) = n ŷ n ln[p(ŷ n =1 x n,w)] + (1-ŷ n ) ln[p(y n =0 x n,w)] With noisy labels: L(w) = n ŷ n ln[γ h1 * p(y n =0 x n,w) + (1-γ h1 ) * p(y n =1 x n,w)] + (1-ŷ n ) ln[(1-γ h0 ) * p(y n =0 x n,w) + γ h0 * p(y n =1 x n,w)] With p(y=1 x,w) = 1 1+e -w x Optimize with multiplicative updates y 0 1 ŷ 0 (1-γ h0 ) γ h0 1 γ h1 (1-γ h1 ) Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications

19 Discriminative model comparison With multiplicative updates and conjugate gradient method With EM and Newton s method Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications

20 Discriminative model comparison Also: extend to multi-class situations; prove convergence; introduce Bayesian regularization term Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications

21 If you want more Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition Natarajan, N., et al. Learning with noisy labels. NIPS Raykar, V. et al. Learning from crowds. Journal of Machine Learning Research

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation