Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014
Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. Proceedings of 18 th ICML. 2001. Brief extensions/comparisons: Extension: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007. Discriminative model approach: Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. Machine Learning and Knowledge Discovery. 2012.
Motivation Noisy phenotyping labels for tuberculosis Slightly resistant samples may not exhibit growth Cut-offs for defining resistance are not perfect Sloppy labels such as tasks that require repetitive human labeling Extensions to semi-supervised learning Many situations!
Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. General framework: generative model y x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y)
Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. General framework: generative model y x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y) P(ŷ y): probability table p(x y): N(x m y, y ) P(y): p(y=1) = p(y=0) = 1-
General framework: generative model y y x ŷ x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y) P(x,y,ŷ) = P(y ŷ)p(x y)p(ŷ) P(ŷ y): probability table p(x y): N(x m y, y ) P(y): p(y=1) = p(y=0) = 1- P(y ŷ): probability table p(x y): N(x m y, y ) P(ŷ): p(ŷ=1) = h p(ŷ=0) = 1- h y 0 1 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. ŷ 0 (1- h0 ) h0 1 h1 (1- h1 )
Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Computing the log-likelihood p(y,x) = p(y x)p(x) = p(x y)p(y) Bayes: p(y x) = p(x y) * p(y) p(x) posterior = likelihood * prior evidence Non-noisy case: Noisy case:
Computing the log-likelihood Typically, marginalize over the latent variable: P(x ŷ, ) = p(x,y ŷ, ) = p(y x, ŷ, ) * p(x ) Alternative perspective: ln [p (x,y ŷ, )] = L (q, ) + KL(q p) Lower-bound on likelihood function Kullback-Leibler divergence ln [p (x,y ŷ, )] = Σ y q(y)*ln[p(x,y )/q(y)] - Σ y q(y) *ln[p(y x, )/q(y)] =0 iff q(y) = p(y x, )
Computing the log-likelihood Alternative perspective: ln [p (x,y ŷ, )] = L (q, ) + KL(q p) ln [p (x,y ŷ, )] = Σ y q(y)*ln[p(x,y )/q(y)] - Σ y q(y) *ln[p(y x, )/q(y)] =0 iff q(y) = p(y x, ) = R Lower-bound on likelihood function Kullback-Leibler divergence Bishop, C.M. Pattern Recognition and Machine Learning. 2006.
Expectation-maximization algorithm 0. Initialize parameters 1. E-step: Compute the posterior distribution over y (latent variable) 2. M-step: Optimize R (complete log likelihood) with respect to Iterate until convergence Class 1: m 1, 1 Here, modelling as two Gaussian distributions: y = m y, y Class 2: m 2, 2 Image: http://math.bu.edu/people/sray/mat3.gif
Expectation-maximization algorithm 0. Initialize parameters 1. E-step: Compute the posterior distribution over y (latent variable) 2. M-step: Optimize R (complete log likelihood) with respect to P(y x, ŷ, old ) = p(x,y ŷ, old ) p(x ŷ, old ) = p(x y,ŷ, old )*p(y ŷ, old ) p(x ŷ, old, y=1) + p(x ŷ, old, y=0) Take derivative of L (lower bound) and set equal to zero; rearrange to get equations Iterate until convergence
Expectation-maximization algorithm 2. M-step: Optimize R (complete log likelihood) with respect to Finally: Use parameters derived from E-M algorithm to make classification decisions on new data Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007.
Expectation-maximization algorithm Eventually, reach local maximum of log likelihood 1., again. Compute lower bound on log likelihood using justcalculated 1. Compute lower bound on log likelihood using old 2. Find new by maximizing lower bound Bishop, C.M. Pattern Recognition and Machine Learning. 2006.
Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Performance Standard formulation Adjusted for noisy labels Ok. Standard formulation Adjusted for noisy labels
Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Brief extensions/comparisons: Extension: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007. Discriminative model approach: Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.
Extension to kernels Via Fisher s discriminant - Idea: best separation occurs when maximize: variance between classes variance within classes - Optimal w is proportional to w (m 0 -m 1 ) - With discriminating hyperplane: w T x Extending to kernels: Projection: At max: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007.
Extension to mixtures of Gaussians Steps: - Estimate the number of mixture components by optimizing total log-likelihood - Associate each mixture component to the noisy labels - Optimize mixture density parameters with EM - Associate updated mixture components to class labels - Use to create Bayes classifier Class 1 Class 2 Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007.
Discriminative model comparison Logistic regression Without noisy labels: L(w) = n ŷ n ln[p(ŷ n =1 x n,w)] + (1-ŷ n ) ln[p(y n =0 x n,w)] With noisy labels: L(w) = n ŷ n ln[γ h1 * p(y n =0 x n,w) + (1-γ h1 ) * p(y n =1 x n,w)] + (1-ŷ n ) ln[(1-γ h0 ) * p(y n =0 x n,w) + γ h0 * p(y n =1 x n,w)] With p(y=1 x,w) = 1 1+e -w x Optimize with multiplicative updates y 0 1 ŷ 0 (1-γ h0 ) γ h0 1 γ h1 (1-γ h1 ) Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.
Discriminative model comparison With multiplicative updates and conjugate gradient method With EM and Newton s method Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.
Discriminative model comparison Also: extend to multi-class situations; prove convergence; introduce Bayesian regularization term Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.
If you want more Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012. Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007. Natarajan, N., et al. Learning with noisy labels. NIPS. 2013. Raykar, V. et al. Learning from crowds. Journal of Machine Learning Research. 2010.