Learning from the Wisdom of Crowds by Minimax Entropy. Denny Zhou, John Platt, Sumit Basu and Yi Mao Microsoft Research, Redmond, WA

Similar documents
Aggregating Ordinal Labels from Crowds by Minimax Conditional Entropy. Denny Zhou Qiang Liu John Platt Chris Meek

Uncovering the Latent Structures of Crowd Labeling

arxiv: v2 [cs.lg] 17 Nov 2016

Crowdsourcing via Tensor Augmentation and Completion (TAC)

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems

Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

Permuation Models meet Dawid-Skene: A Generalised Model for Crowdsourcing

Learning Medical Diagnosis Models from Multiple Experts

On the Impossibility of Convex Inference in Human Computation

arxiv: v3 [stat.ml] 1 Nov 2014

Crowdsourcing label quality: a theoretical analysis

arxiv: v2 [cs.lg] 20 May 2018

ECE521 week 3: 23/26 January 2017

Logistic Regression. COMP 527 Danushka Bollegala

Adaptive Crowdsourcing via EM with Prior

Machine Learning for Signal Processing Bayes Classification

Machine Learning Linear Models

CS 188: Artificial Intelligence. Outline

The Benefits of a Model of Annotation

10-701/ Machine Learning - Midterm Exam, Fall 2010

arxiv: v3 [cs.lg] 25 Aug 2017

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernel Methods and Support Vector Machines

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Introduction to Support Vector Machines

Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective

ECS289: Scalable Machine Learning

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Support Vector Machines

Introduction to Logistic Regression

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Models of collective inference

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

ECE521 Lectures 9 Fully Connected Neural Networks

MLE/MAP + Naïve Bayes

MIRA, SVM, k-nn. Lirong Xia

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Fantope Regularization in Metric Learning

CS 6375 Machine Learning

Linear & nonlinear classifiers

A Tutorial on Support Vector Machine

The Naïve Bayes Classifier. Machine Learning Fall 2017

Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Variational Inference for Crowdsourcing

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Bandit-Based Task Assignment for Heterogeneous Crowdsourcing

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

MLE/MAP + Naïve Bayes

Perceptron Revisited: Linear Separators. Support Vector Machines

Factor Modeling for Advertisement Targeting

Multiclass Multilabel Classification with More Classes than Examples

Jointly Clustering Rows and Columns of Binary Matrices: Algorithms and Trade-offs

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Binary Classification / Perceptron

9.520 Problem Set 2. Due April 25, 2011

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Classification and Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Iterative Learning for Reliable Crowdsourcing Systems

Machine Learning. Regression. Manfred Huber

Convex Optimization Algorithms for Machine Learning in 10 Slides

Crowdsourcing with Multi-Dimensional Trust

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Scaling Neighbourhood Methods

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Generative v. Discriminative classifiers Intuition

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

UVA CS 4501: Machine Learning

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

Machine Learning, Fall 2011: Homework 5

Click Prediction and Preference Ranking of RSS Feeds

Lecture 4 Discriminant Analysis, k-nearest Neighbors

A Minimax Optimal Algorithm for Crowdsourcing

Unsupervised Learning

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS 188: Artificial Intelligence. Machine Learning

Max Margin-Classifier

CSC 411: Lecture 04: Logistic Regression

An Empirical Study of Building Compact Ensembles

Interpreting Deep Classifiers

The Projected Power Method: An Efficient Algorithm for Joint Alignment from Pairwise Differences

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning for Signal Processing Bayes Classification and Regression

Transcription:

Learning from the Wisdom of Crowds by Minimax Entropy Denny Zhou, John Platt, Sumit Basu and Yi Mao Microsoft Research, Redmond, WA

Outline 1. Introduction 2. Minimax entropy principle 3. Future work and conclusion Microsoft Research Redmond 2

1. Introduction Microsoft Research Redmond 3

Machine Learning Meets Crowdsourcing To Improve a machine learning model: Add more training examples Create more meaningful features Invent more powerful learning algorithms More and more efforts, less and less gain

Machine Learning Meets Crowdsourcing To Improve a machine learning model: Adding more training examples Creating more meaningful features Inventing more powerful learning algorithms More and more efforts, less and less gain

Crowdsourcing for Labeling Microsoft Research Redmond 6

Low Cost, but also Low Quality Norfolk Terrier Norwich Terrier (Stanford dogs dataset) Irish Wolfhound Scottish Deerhound Microsoft Research Redmond 7

Problem Setting and Notations Workers: i = 1, 2,, m Items: j = 1, 2,, n Categories: k = 1, 2,, c Response matrix Z m n c z ijk = 1, if worker i labels item j as category k z ijk = 0, if worker i labels item j as other (not k) z ijk = unknown, if worker i does not label item j Goal: Estimate the ground truth {y jk } Microsoft Research Redmond 8

Toy Example: Binary Labeling Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 Problem: What are the true labels of the items? Microsoft Research Redmond 9

A Simple Method: Majority Voting Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 By majority voting, the true label of item 4 should be class 1: # {workers labeling it as class 1} = 3 # {workers labeling it as class 2} = 2 Improve: More skillful workers should have more weight Microsoft Research Redmond 10

Dawid & Skene s Method Assume that each worker is associated with a c c confusion matrix i {p kl = Prob[z ij = l y j = k, i]} For any labeling task, the label by a worker is generated according to her confusion matrix Maximum Likelihood Estimation (MLE): jointly estimate confusion matrices and ground truth Implementation: EM algorithm Microsoft Research Redmond 11

Probabilistic Confusion Matrices Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 Assume that the true labels are: Class 1 = {item 1, item 2, item 3} Class 2 = {item 4, item 5, item 6} Class 1 Class 2 Class 1 1 0 Class 2 2/3 1/3 Microsoft Research Redmond 12

EM in Dawid & Skene s Method Initialize the ground truth by majority vote Iterate the following procedure till converge: o Estimate the worker confusion by using the estimated ground truth o Estimate the ground truth by using the estimated worker confusion Microsoft Research Redmond 13

Simplified Dawid & Skene s Method Each worker i is associated with a single number p i 0,1 such that Prob z ij = y j i Prob z ij y j i = p i = 1 p i Microsoft Research Redmond 14

Simplified Dawid & Skene s Method Each worker i is associated with a single number p i 0,1 such that Prob z ij = y j i Prob z ij y j i = p i = 1 p i worker coin Microsoft Research Redmond 15

2. Minimax Entropy Principle Microsoft Research Redmond 16

Our Basic Assumption Observed labels unobserved distributions

Our Basic Assumption Separated distribution per work-item!

Our Basic Assumption Separated distribution per work-item!

Maximum Entropy To estimate a distribution, it is typical to use the maximum entropy principle E. T. Jaynes

Column and Row Matching Constraints

Column Constraints column matching For each item: Count # workers labeling it as class 1 Count # workers labeling it as class 2 Worker 1 Worker 2 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 Microsoft Research Redmond 22

Row Constraints For each worker: Count # misclassifications from class 1 to 2 Count # misclassifications from class 2 to 1 row matching Worker 1 Worker 1 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 Microsoft Research Redmond 23

Maximum Entropy Subject to (column constraint) (row constraint)

To Estimate True Labels, Can We Subject to (column constraint) (row constraint)

To Estimate True Labels, Can We Subject to (column constraint) (row constraint) Leading to a uniform distribution for y jl

To Estimate True Labels, Can We Subject to (column constraint) (row constraint) Leading to a uniform distribution for y jl

Subject to (column constraint) (row constraint)

Justification of Minimum Entropy Assume true measurement are available: Subject to true measurements Microsoft Research Redmond 29

Justification of Minimum Entropy Theorem. Minimizing the KL divergence is equivalent to minimize entropy. Microsoft Research Redmond 30

Lagrangian Dual The Lagrangian dual can be written as Lagrangian multipliers Microsoft Research Redmond 31

Lagrangian Dual KKT conditions lead to a closed-form: Z is the normalization factor given by Microsoft Research Redmond 32

Worker Expertise & Task Confusability Explanation of dual variables: item confusability worker expertise Microsoft Research Redmond 33

Measurement Objectivity: Item Objective item confusability. The difference of difficulty between labeling two items should be independent of the chosen workers Mathematical formulation. Let Then the ratio c(i, j, k)/c i, j, k should be Independent of the choices of i, i Microsoft Research Redmond 34

Measurement Objectivity: Worker Objective worker expertise. The difference of expertise between two workers should be independent of the item being labeled Mathematic Formulation. Let Then the ratio c(i, j, k)/c i, j, k should be Independent of the choices of j, j Microsoft Research Redmond 35

The Labeling Model Is Objective Theorem. For deterministic labels, the labeling model given by uniquely satisfies the measurement objectivity principle Microsoft Research Redmond 36

Constraint Relaxation column matching For each item: Count # workers labeling it as class 1 Count # workers labeling it as class 2 Worker 1 Worker 1 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 Microsoft Research Redmond 37

Constraint Relaxation row matching For each worker: Count # misclassifications from class 1 to 2 Count # misclassifications from class 2 to 1 Worker 1 Worker 1 Worker 3 Worker 4 Worker 5 Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 Microsoft Research Redmond 38

Constraint Relaxation Subject to Relaxing moment constraints to prevent overfitting

Implementation Convert the primal problem to its dual form Coordinate descent Split the variables into two blocks: y, {τ, σ} Each subproblem is convex and smooth Initialize ground truth by majority vote Microsoft Research Redmond 40

Model Selection k-fold cross validation to choose (α, β) Split the data matrix into k folds Each fold used as a validation set once Compute average likelihood over validations We don t need ground truth for model selection! Microsoft Research Redmond 41

Experiments: Image Labeling 108 bird images, 2 breeds, 39 workers Each image was labeled by all workers From: P. Welinder, S. Branson, S. Belongie and P. Perona. The Multidimensional Wisdom of Crowds. NIPS 2010. Microsoft Research Redmond 42

Experiments: Image Labeling Experimental results (accuracy, %) Worker Number 10 20 30 Minimax Entropy 85.18 92.59 93.52 Dawid & Skene 79.63 87.04 87.96 Dawid & Skene (S)* 45.37 57.41 75.93 Majority Voting 67.59 83.33 76.85 Average Worker 62.78 * Dawid & Skene (S): simplified Dawid and Skene s method Microsoft Research Redmond 43

Experiments: Image Labeling Experimental results (accuracy, %) Worker Number 10 20 30 Minimax Entropy 85.18 92.59 93.52 Dawid & Skene 79.63 87.04 87.96 Dawid & Skene (S) 45.37 57.41 75.00 Majority Voting 67.59 83.33 76.85 Average Worker 62.78 It is risky to model worker expertise by a single number Microsoft Research Redmond 44

Experiments: Web Search 177 workers and 2665 <query, URL> pairs 5 classes: perfect, excellent, good, fair and bad Each pair was labeled by 6 workers Minimax Entropy 88.84 Dawid & Skene 84.09 Majority Voting 77.65 Average worker 37.05 Microsoft Research Redmond 45

Comparing with More Methods Other methods: Raykar et al (JMLR 2010, adding beta/dirichlet prior), Welinder et al (NIPS 2010, matrix factorization), Karger et al (NIPS, 2011, BP-like iterative algorithm) From the evaluation in (Liu et al. NIPS 2012) None of them can outperform Dawid and Skene s Karger et al (NIPS, 2011) is even much worse than majority voting Microsoft Research Redmond 46

3. Future Work and Conclusion Microsoft Research Redmond 47

Budget-Optimal Crowdsourcing Assume that we have a budget to get 6 labels. Which one deserves another label, item 2 or 3? How about having a budget of 7 labels or even more? 1 st round 2 nd round Item 1 1 1 Item 2 1-1 Item 3 1 Microsoft Research Redmond 48

Contextual Minimax Entropy Contextual information of items and workers (An example) Label a web page as spam or nonspam by a group of workers For each web page: its URL ends with.edu or not, popularity of its domain, creating time For each worker: education degree, reputation history, working experience Microsoft Research Redmond 49

Beyond Labeling Mobile crowdsourcing platform Crowdsourcing machine translation Crowdsourcing indoor/outdoor navigation Crowdsourcing design Wikipedia

ICML 13 Workshop Machine Learning Meets Crowdsourcing http://www.ics.uci.edu/~qliu1/mlcrowd_icml _workshop/ Microsoft Research Redmond 51

Summary Proposed minimax entropy principle for estimating ground truth from noisy labels Both task confusability and worker expertise are taken into account in our method Measurement objectivity is implied Microsoft Research Redmond 52

Acknowledgments Daniel Hsu (MSR New England) Xi Chen (Carnegie Mellon University) Gabriella Kazai (MSR Cambridge) Chris Burges (MSR Redmond) Chris Meek (MSR Redmond) Microsoft Research Redmond 53