Aggregating Ordinal Labels from Crowds by Minimax Conditional Entropy. Denny Zhou Qiang Liu John Platt Chris Meek

Similar documents
Learning from the Wisdom of Crowds by Minimax Entropy. Denny Zhou, John Platt, Sumit Basu and Yi Mao Microsoft Research, Redmond, WA

arxiv: v2 [cs.lg] 17 Nov 2016

Crowdsourcing via Tensor Augmentation and Completion (TAC)

Uncovering the Latent Structures of Crowd Labeling

Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

The Benefits of a Model of Annotation

Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems

CS 188: Artificial Intelligence. Outline

Kernel Methods and Support Vector Machines

arxiv: v3 [cs.lg] 25 Aug 2017

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Permuation Models meet Dawid-Skene: A Generalised Model for Crowdsourcing

CS 188: Artificial Intelligence Fall 2011

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Learning From Crowds. Presented by: Bei Peng 03/24/15

Adaptive Crowdsourcing via EM with Prior

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons

Linear & nonlinear classifiers

arxiv: v2 [cs.lg] 20 May 2018

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Multiclass Multilabel Classification with More Classes than Examples

CS 188: Artificial Intelligence. Machine Learning

A Wisdom of the Crowd Approach to Forecasting

Machine Learning for Signal Processing Bayes Classification

arxiv: v3 [stat.ml] 1 Nov 2014

STA 4273H: Statistical Machine Learning

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Bayesian Identity Clustering

PMR Learning as Inference

The Naïve Bayes Classifier. Machine Learning Fall 2017

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

ECE521 week 3: 23/26 January 2017

Statistical Quality Control for Human Computation and Crowdsourcing

Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective

CS 343: Artificial Intelligence

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Exploration and Unsupervised Learning with Clustering

CPSC 340: Machine Learning and Data Mining

Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing

Bandit-Based Task Assignment for Heterogeneous Crowdsourcing

CS 188: Artificial Intelligence Spring Today

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning

CS 5522: Artificial Intelligence II

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE

Naïve Bayesian. From Han Kamber Pei

A Novel Click Model and Its Applications to Online Advertising

Latent Class Analysis for Models with Error of Measurement Using Log-Linear Models and An Application to Women s Liberation Data

A Bayesian model for fusing biomedical labels

Minimax risk bounds for linear threshold functions

Multicategory Crowdsourcing Accounting for Plurality in Worker Skill and Intention, Task Difficulty, and Task Heterogeneity

Listwise Approach to Learning to Rank Theory and Algorithm

Support Vector Machines

Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Decoupled Collaborative Ranking

Information Retrieval

Predicting the Probability of Correct Classification

Laconic: Label Consistency for Image Categorization

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Undirected Graphical Models

Bayesian Estimation Under Informative Sampling with Unattenuated Dependence

General structural model Part 2: Categorical variables and beyond. Psychology 588: Covariance structure and factor models

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

An Introduction to Machine Learning

TOPIC models, such as latent Dirichlet allocation (LDA),

Chapter 6 Classification and Prediction (2)

Probabilistic Graphical Models

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Deconstructing Data Science

Cromwell's principle idealized under the theory of large deviations

MIRA, SVM, k-nn. Lirong Xia

Machine Learning for NLP

STA 216, GLM, Lecture 16. October 29, 2007

Machine Learning for Computational Advertising

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Aggregating Crowdsourced Ordinal Labels via Bayesian Clustering

Variational Inference for Crowdsourcing

VCMC: Variational Consensus Monte Carlo

Augmented Statistical Models for Speech Recognition

Spectral Unsupervised Parsing with Additive Tree Metrics

Randomized Decision Trees

Reconstruction. Reading for this lecture: Lecture Notes.

18.9 SUPPORT VECTOR MACHINES

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Factor Modeling for Advertisement Targeting

Statistical Pattern Recognition

Logistic Regression. Machine Learning Fall 2018

11. Learning To Rank. Most slides were adapted from Stanford CS 276 course.

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Overview. Multidimensional Item Response Theory. Lecture #12 ICPSR Item Response Theory Workshop. Basics of MIRT Assumptions Models Applications

Transcription:

Aggregating Ordinal Labels from Crowds by Minimax Conditional Entropy Denny Zhou Qiang Liu John Platt Chris Meek

2

Crowds vs experts labeling: strength Time saving Money saving Big labeled data More data beats cleverer algorithms 3

Crowds vs experts labeling: weakness Garbage in Garbage out Crowdsourced labels may be highly noisy 4

Non-experts, redundant labels M O O O O O O M O M O M M M M M Orange (O) vs. Mandarin (M) 5

Non-experts, redundant labels M O O O O O O M O M O M M M M M Orange (O) vs. Mandarin (M) 6

Workers Items 1 2 j 1 x 11 x 12 x 1j 2 x 21 x 21 x 2j i x i1 x i2 x ij Observed worker labels Unobserved true labels: y j 7

Roadmap: from multiclass to ordinal 1. Develop a method to aggregate general multiclass labels 2. Adapt the general method to ordinal labels 8

Examples on multiclass labeling Image categorization Speech recognition 9

Introduce two fundamental concepts Empirical count of wrong/correct labels Expected number of wrong/correct labels : worker label distribution : true label distribution 10

Multiclass maximum conditional entropy Given the true labels, estimate by subject to worker constraints item constraints 11

Multiclass minimax conditional entropy Jointly estimate and by subject to worker constraints item constraints 12

Lagrangian dual constraints 13

Probabilistic labeling model By the optimization theory, the dual problem leads to normalization factor worker ability item difficulty 14

Dual problem 1. This only generates deterministic labels 2. Equivalent to maximizing complete likelihood 15

Roadmap: from multiclass to ordinal 1. Develop a method to aggregate general multiclass labels 2. Adapt the general method to ordinal labels 16

An example on ordinal labeling Perfect 1 Excellent 2 Good 3 Fair 4 Bad 5 search results 17

To proceed to ordinal labels Formulate assumptions which are specific for ordinal labeling Coincide with the previous multiclass method in the case of binary labeling 18

Our assumption for ordinal labeling adjacency confusability 1 2 3 4 5 likely to confuse unlikely to confuse 19

Formulating this assumption though pairwise comparison Reference label, <, < Indirect label comparison True label Worker label 20

Ordinal minimax conditional entropy Jointly estimate and by subject to worker constraints item constraints Δ: take on values < or : take on values < or 21

Ordinal minimax conditional entropy Jointly estimate and by subject to reference label worker constraints item constraints true label worker label 22

Ordinal minimax conditional entropy Jointly estimate and by subject to reference label worker constraints item constraints difference from multiclass true label worker label 23

Explaining the ordinal constraints For example, let Δ = <, = : counting mistakes in ordinal sense 24

Probabilistic rating model By the KKT conditions, the dual problem leads to worker ability item difficulty structured 25

Regularization Two goals: 1. Prevent over fitting 2. Fix the deterministic label issue to generate probabilistic labels 26

Regularized minimax conditional entropy Jointly estimate and by + regularization terms subject to worker constraints item constraints 27

Regularized minimax conditional entropy Jointly estimate and by subject to worker constraints item constraints 28

Dual problem 1. This generates probabilistic labels 2. Equivalent to maximizing marginal likelihood 29

Choosing regularization parameters Cross-validation: 5 or 10 folds Random split Compare the likelihood of worker labels Don t need ground truth labels for cross-validation! 30

Experiments: metrics Evaluation metrics L0 error: L1 error: L2 error: 31

Experiments: baselines Compare regularized minimax condition entropy to Majority voting Dawid-Skene method (1979, see also its Bayesian version in Raykar et al. 2010, Liu et al. 2012, Chen at al. 2013) Latent trait analysis (Andrich 1978, Master 1982, Uebersax and Grove 1993, Mineiro 2011) 32

Web search data Perfect 1 Excellent 2 Good 3 Fair 4 Bad 5 search results 33

Web search data Some facts about the data: 2665 query-url pairs and a relevance rating scale from 1 to 5 177 non-expert workers with average error rate 63% Each query-url pair is judged by 6 workers True labels are created via consensus from 9 experts Dataset created by Gabriella Kazai of Microsoft 34

Web search data L0 Error L1 Error L2 Error Majority vote 0.269 0.428 0.930 Dawid & Skene 0.170 0.205 0.539 Latent trait 0.201 0.211 0.481 Entropy multiclass 0.111 0.131 0.419 Entropy ordinal 0.104 0.118 0.384 35

Probabilistic labels vs error rates 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 L0 error L1 error L2 error (0, 0.5) (0.5, 0.6) (0.6, 0.7) (0.7, 0.8) (0.8, 0.9) (0.9, 1) 36

Price prediction data $0 $50 1 $51 $100 2 $101 $250 3 $251 $500 4 $501 $1000 5 $1001 $2000 6 $2001 $5000 7 37

Price prediction data Some facts about the data: 80 household items collected from stores like Amazon and Costco Prices predicted by 155 students of UC Irvine Average error rate 69% and systematically biased Dataset created by Mark Steyvers of UC Irvine 38

Price prediction data L0 Error L1 Error L2 Error Majority vote 0.675 1.125 1.605 Dawid & Skene 0.650 1.050 1.517 Latent trait 0.688 1.063 1.504 Entropy multiclass 0.675 1.150 1.643 Entropy ordinal 0.613 0.975 1.492 39

Summary Minimax conditional entropy principle for crowdsourcing Adjacency confusability assumption in ordinal labeling Ordinal labeling model with structured confusion matrices http://research.microsoft.com/en-us/projects/crowd/ 40