Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Similar documents
> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 4. Linear Models for Classification

CMU-Q Lecture 24:

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Generative v. Discriminative classifiers Intuition

Curve Fitting Re-visited, Bishop1.2.5

Introduction to Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Regression and Discrimination

Machine Learning Lecture 5

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

An Introduction to Statistical and Probabilistic Linear Models

Nonparametric Bayesian Methods (Gaussian Processes)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes classification

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Introduction to Machine Learning

Machine Learning Lecture 7

Pattern Recognition and Machine Learning

Generative v. Discriminative classifiers Intuition

Expectation Propagation Algorithm

Introduction to Machine Learning

Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Recent Advances in Bayesian Inference Techniques

Linear Models for Classification

MLE/MAP + Naïve Bayes

Density Estimation. Seungjin Choi

Least Squares Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Least Squares Regression

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Expectation Propagation for Approximate Bayesian Inference

Probabilistic Machine Learning. Industrial AI Lab.

Variational Inference via Stochastic Backpropagation

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Linear & nonlinear classifiers

Bayesian Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Posterior Regularization

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Hidden Markov Models

Data Mining Techniques

Machine Learning Practice Page 2 of 2 10/28/13

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian discriminant analysis Naive Bayes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

GWAS V: Gaussian processes

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Overfitting, Bias / Variance Analysis

ECE521 week 3: 23/26 January 2017

COMP90051 Statistical Machine Learning

Machine learning - HT Maximum Likelihood

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Machine Learning Techniques for Computer Vision

CPSC 540: Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Expectation Maximization

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Introduction to Probabilistic Machine Learning

The Variational Gaussian Approximation Revisited

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 3: Pattern Classification

Reading Group on Deep Learning Session 1

Linear and logistic regression

6.867 Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Lecture 1a: Basic Concepts and Recaps

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for Signal Processing Bayes Classification and Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Statistical Learning Reading Assignments

Machine Learning 4771

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Learning Theory

Machine Learning, Fall 2012 Homework 2

Lecture 4: Probabilistic Learning

Regression with Numerical Optimization. Logistic

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Introduction to Graphical Models

Transcription:

Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014

Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. Proceedings of 18 th ICML. 2001. Brief extensions/comparisons: Extension: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007. Discriminative model approach: Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. Machine Learning and Knowledge Discovery. 2012.

Motivation Noisy phenotyping labels for tuberculosis Slightly resistant samples may not exhibit growth Cut-offs for defining resistance are not perfect Sloppy labels such as tasks that require repetitive human labeling Extensions to semi-supervised learning Many situations!

Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. General framework: generative model y x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y)

Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. General framework: generative model y x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y) P(ŷ y): probability table p(x y): N(x m y, y ) P(y): p(y=1) = p(y=0) = 1-

General framework: generative model y y x ŷ x ŷ P(x,y,ŷ) = P(ŷ y)p(x y)p(y) P(x,y,ŷ) = P(y ŷ)p(x y)p(ŷ) P(ŷ y): probability table p(x y): N(x m y, y ) P(y): p(y=1) = p(y=0) = 1- P(y ŷ): probability table p(x y): N(x m y, y ) P(ŷ): p(ŷ=1) = h p(ŷ=0) = 1- h y 0 1 Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. ŷ 0 (1- h0 ) h0 1 h1 (1- h1 )

Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Computing the log-likelihood p(y,x) = p(y x)p(x) = p(x y)p(y) Bayes: p(y x) = p(x y) * p(y) p(x) posterior = likelihood * prior evidence Non-noisy case: Noisy case:

Computing the log-likelihood Typically, marginalize over the latent variable: P(x ŷ, ) = p(x,y ŷ, ) = p(y x, ŷ, ) * p(x ) Alternative perspective: ln [p (x,y ŷ, )] = L (q, ) + KL(q p) Lower-bound on likelihood function Kullback-Leibler divergence ln [p (x,y ŷ, )] = Σ y q(y)*ln[p(x,y )/q(y)] - Σ y q(y) *ln[p(y x, )/q(y)] =0 iff q(y) = p(y x, )

Computing the log-likelihood Alternative perspective: ln [p (x,y ŷ, )] = L (q, ) + KL(q p) ln [p (x,y ŷ, )] = Σ y q(y)*ln[p(x,y )/q(y)] - Σ y q(y) *ln[p(y x, )/q(y)] =0 iff q(y) = p(y x, ) = R Lower-bound on likelihood function Kullback-Leibler divergence Bishop, C.M. Pattern Recognition and Machine Learning. 2006.

Expectation-maximization algorithm 0. Initialize parameters 1. E-step: Compute the posterior distribution over y (latent variable) 2. M-step: Optimize R (complete log likelihood) with respect to Iterate until convergence Class 1: m 1, 1 Here, modelling as two Gaussian distributions: y = m y, y Class 2: m 2, 2 Image: http://math.bu.edu/people/sray/mat3.gif

Expectation-maximization algorithm 0. Initialize parameters 1. E-step: Compute the posterior distribution over y (latent variable) 2. M-step: Optimize R (complete log likelihood) with respect to P(y x, ŷ, old ) = p(x,y ŷ, old ) p(x ŷ, old ) = p(x y,ŷ, old )*p(y ŷ, old ) p(x ŷ, old, y=1) + p(x ŷ, old, y=0) Take derivative of L (lower bound) and set equal to zero; rearrange to get equations Iterate until convergence

Expectation-maximization algorithm 2. M-step: Optimize R (complete log likelihood) with respect to Finally: Use parameters derived from E-M algorithm to make classification decisions on new data Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007.

Expectation-maximization algorithm Eventually, reach local maximum of log likelihood 1., again. Compute lower bound on log likelihood using justcalculated 1. Compute lower bound on log likelihood using old 2. Find new by maximizing lower bound Bishop, C.M. Pattern Recognition and Machine Learning. 2006.

Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Performance Standard formulation Adjusted for noisy labels Ok. Standard formulation Adjusted for noisy labels

Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Brief extensions/comparisons: Extension: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007. Discriminative model approach: Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.

Extension to kernels Via Fisher s discriminant - Idea: best separation occurs when maximize: variance between classes variance within classes - Optimal w is proportional to w (m 0 -m 1 ) - With discriminating hyperplane: w T x Extending to kernels: Projection: At max: Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007.

Extension to mixtures of Gaussians Steps: - Estimate the number of mixture components by optimizing total log-likelihood - Associate each mixture component to the noisy labels - Optimize mixture density parameters with EM - Associate updated mixture components to class labels - Use to create Bayes classifier Class 1 Class 2 Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007.

Discriminative model comparison Logistic regression Without noisy labels: L(w) = n ŷ n ln[p(ŷ n =1 x n,w)] + (1-ŷ n ) ln[p(y n =0 x n,w)] With noisy labels: L(w) = n ŷ n ln[γ h1 * p(y n =0 x n,w) + (1-γ h1 ) * p(y n =1 x n,w)] + (1-ŷ n ) ln[(1-γ h0 ) * p(y n =0 x n,w) + γ h0 * p(y n =1 x n,w)] With p(y=1 x,w) = 1 1+e -w x Optimize with multiplicative updates y 0 1 ŷ 0 (1-γ h0 ) γ h0 1 γ h1 (1-γ h1 ) Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.

Discriminative model comparison With multiplicative updates and conjugate gradient method With EM and Newton s method Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.

Discriminative model comparison Also: extend to multi-class situations; prove convergence; introduce Bayesian regularization term Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012.

If you want more Bootkrajang, J. & Kaban, A. Label-noise robust logistic regression and applications. 2012. Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of Label Noise. 2001. Li, Y., et al. Classification in the presence of class noise using a probabilistic Kernel Fisher method. Pattern Recognition. 2007. Natarajan, N., et al. Learning with noisy labels. NIPS. 2013. Raykar, V. et al. Learning from crowds. Journal of Machine Learning Research. 2010.