ONLINE APPENDICES for Cost-Effective Quality Assurance in Crowd Labeling

ONLINE APPENDICES for Cost-Effetive Quality Assurane in Crowd Labeling Jing Wang Shool of Business and Management Hong Kong University of Siene and Tehnology Clear Water Bay Kowloon Hong Kong jwang@usthk Panagiotis G Ipeirotis Foster Provost Leonard Stern Shool of Business New York University New York NY 2 {panos fprovost}@sternnyuedu

2 Artile submitted to Information Systems Researh; manusript no ISR-24-2 Appendix A: Importane of Quality Control for Binary Choie Questions Our sheme an be diretly applied to binary hoie questions whih already apture a large number of tasks that are rowdsoured today eg sentiment judgement spam detetion We would like to stress though that quality ontrol mehanisms for binary hoie questions are at the heart of many other more omplex tasks that are also exeuted in rowdsouring platforms Below we give some representative examples ˆ Open-ended questions with orret or inorret answers: Consider the task of olleting information about a given topi; for example ollet URLs that disuss massive online eduation ourses and their impat on MBA programs For this type of task it is usually diffiult or infeasible to enumerate all the orret answers therefore it is not possible to ontrol the quality of the task using the quality ontrol mehanism for binary hoie answers diretly However one an answer is provided we an easily hek its orretness by instantiating another task asking a binary hoie question: Is this submitted URL about massive online eduation ourses and their impat on MBA programs? Thereby one an break the task into two subtasks: a Create task in whih one or more workers submit free-form answers and a Verify task in whih another set of workers vet the submitted answers and lassify them as either orret or inorret Figure Aa illustrates the struture: a Verify task ontrols the quality of a Create task; the quality of the Verify task is then ontrolled using a quality ontrol mehanism for binary hoie questions similar to the one presented in this paper ˆ Varying degrees of orretness: There are some tasks whose free-form answers are not right or wrong but have varying degrees of orretness or goodness eg generate a transript from this manusript desribe and explain the image below in at least three sentenes In suh a setting treating the submitted answers as orret or inorret might be ineffiient: A rejeted answer would be ompletely disarded although it is often possible to leverage low-quality answers to get better results by simply iterating Past work Little et al 2 has shown the superiority of the iterative paradigm by demonstrating that workers are able to reate image desriptions of exellent quality even though no single worker puts any signifiant effort Figure Ab illustrates the iterative proess There are four subtasks: a Create task in whih free-form answers are submitted an Improve task in whih workers are asked to improve an existing answer a Compare task in whih workers are required to ompare two answers and selet the better one and a Verify task in whih workers deide whether the quality of the answers is satisfatory In this ase the Compare task and Verify task are binary hoie tasks and one an use the mehanisms presented in this paper to ontrol the quality of the submitted answers and of the partiipating workers In turn the quality of the Create task and Improve task is ontrolled by Verify and Compare tasks as one an measure the probability that a worker submits an answer of high quality or the probability that a worker is able to improve an existing answer ˆ Complex tasks using workflows: Initial appliations of paid miro-rowdsouring foused primarily on simple and routine tasks However many tasks in our daily life are muh more ompliated eg proofread the following paragraph from the draft of a student s essay write a travel guide about New York City Verify task either aepts input diretly from Create task or gets the better answer returned by Compare task

Better answer Wang Ipeirotis and Provost: Cost-Effetive Quality Assurane in Crowd Labeling Artile submitted to Information Systems Researh; manusript no ISR-24-2 3 START START Create Task Create Task Verify Task Verify Task Answer is good? Yes Aept Yes Answer is Corret? No No Improve Task END Aept Rejet END Compare Task Figure A a Corret or inorret answers Workflows for two types of tasks b Varying degrees of orretness and reently there is an inreasing trend to aomplish suh tasks by dividing omplex tasks into a set of miro-tasks using workflows For example Bernstein et al 2 introdue the Find-Fix-Verify pattern to split text editing tasks into three simple operations: Find something that needs fixing fix the problem if there is one and verify the orretness of the fix Again this task ends up having quality ontrol through a set of binary hoie tasks verifiation of the fix verifiation that something needs fixing In other ases Kittur et al 2 desribe a framework for parallelizing the exeution of suh workflows and Kulkarni et al 2 move a step further by allowing workers themselves to design the workflow As in the ase of other tasks that are broken into workflows of miro-tasks the quality of these omplex tasks an be guaranteed by applying our quality ontrol sheme to eah single miro-task following the paradigms desribed above

4 Artile submitted to Information Systems Researh; manusript no ISR-24-2 Appendix B: Full Derivation In our model the set of labels L = {l k } are known while the quality of eah worker k the easiness of eah objet and the true lass of eah objet are unknown and have to be estimated from the set of given labels Following Whitehill et al 29 we use expetation-maximization approah to obtain the maximum likelihood estimates of the α k β and t for eah worker k and eah objet E-step: The posterior probability of t given {α k } and {β } is haraterized by: p t L {α k } {β } = p t L {α k k K } β p t {α k k K } β p L t {α k k K } β sine l k s are ond indep given t {α k } and β pt k l t α k β t pt k K p k K + e αk β t I l k =t + e αk β t I l k = t Following equation we an alulate the posterior probability of t using the prior probability of t the values of {α k k K } and the value of β estimated from the previous M-step M-step: We maximize the auxiliary funtion Q whih is defined as the expetation of the joint loglikelihood of the observed and hidden variables L {t } given the parameters {α k } {β } where the values of hidden variables {t } are omputed during the previous E-step We an also impose a prior on eah parameter The prior probabilities of α k α k and β are denoted as pα k pα k and pβ respetively Q{α k } {β } = E [ ln pl {t } {α k } {β }p{α k } {β } ] = E [ ln + ln k i= = + k pt pl k t α k β ] t k K pα k i + ln pβ E [ ln pt ] + i= k K E ln pα k i + ln pβ [ ln pl k t α k t β ] where the expetation is taken over {t } estimated during the previous E-step The values of {α k } and {β } are obtained by maximizing the auxiliary funtion Q This is not diretly solvable therefore we apply a gradient asent approah to find parameter values that loally maximize Q Let us define i = pt = i estimated from the previous E-step then i ln pt = i + i= k K i= Q{α k } {β } = + k i= ln pα k i + ln pβ i ln pl k t = i α k i β 2

Artile submitted to Information Systems Researh; manusript no ISR-24-2 5 Based on equation and 2 we have: pl k t = α k β = σα k β lk k σα β l k and pl k t = α k β = σα k β lk σα k β l k where σx = / + e x is the logisti funtion Then p ln pt = + ln pt = and Q{α k } {β } = Using the fat that + + + k k K k K l k l k ln σαk β + l k ln σαk β ln σαk β + l k ln σαk β k ln pα + ln pα k + ln pβ d ln σx = σx dx d ln σx = σx dx we differentiate funtion Q with respet to {α k } and {β }: Q α k Q α k = O k = O k = O k = O k l k σαk β β l k β l k σαk β + l k Q β = α k k K σαk β β + d ln pαk dα k σαk β β l k β l k σαk β + l k d ln pαk dα k σαk β β + d ln pαk dα k d ln pαk dα k σαk β + α k k l σαk β + d ln pβ dβ To find loally optimal values of {α k } and {β } we set the gradient to zero The resulting equations are nonlinear and we use iterative methods to solve them Using gradient asent we take steps proportional to the positive of the gradient and approah the loal maximum of the funtion eventually

6 Artile submitted to Information Systems Researh; manusript no ISR-24-2 Appendix C: Proofs C Proof of Proposition 3 Proof The estimated mislassifiation ost at step m is EstCost m = EstCost m = min j {} i= i m ij = min{ m m } Worker k assigns to objet a label with probability pl k = = p + and a label ξko = + e αk β with probability pl k = = p + where ξ ko = ξ ko = If l k = and ξko + e αk β + e αk = the new lass probability estimate for objet is m+ = β + and the assoiated estimated mislassifiation ost is EstCost m+ = min{ m+ m+ } = min { + If l k = the new lass probability estimate for objet is m+ = + and the assoiated estimated mislassifiation ost is EstCost m+ = min{ m+ m+ } = min { + Therefore the expeted mislassifiation ost at step m + is + + + + EEstCost m+ = pl k = EstCostp m+ + pl k = EstCostp m+ = + min { + + + min { + = min{ } + min{ } If the predited label at step m and step m + is then: } } + e αk β + } + EEstCost m+ = + = m ξ ko + ξ ko = m = EstCost m If the predited label at step m and step m + is then: EEstCost m+ = + = m ξ ko + ξ ko = m = EstCost m } Therefore EstCost m = EEstCost m+

Artile submitted to Information Systems Researh; manusript no ISR-24-2 7 C2 Proof of Proposition 4 Proof The estimated mislassifiation ost at step m is EstCost m = EstCost m = min j {} i= i m ij = min{ m m } Worker k assigns to objet a label with probability pl k = = p + m e k and a label with probability pl k = = p + m e k If l k = the new lass probability estimate for objet is m+ = + m e k and the assoiated estimated mislassifiation ost is EstCost m+ = min{ m+ m+ } = min { m e k + m e k If l k = the new lass probability estimate for objet is m+ = + m e k and the assoiated estimated mislassifiation ost is EstCost m+ = min{ m+ m+ } = min { m e k + m e k m e k + m e k + m e k m e k + m e k + m e k } } Therefore the expeted variation in mislassifiation ost is E EstCost m EstCost m+ = pl k EstCost = m EstCost m+ + pl k EstCost = m EstCost m+ = + m e k min{p m m e k m } min{ + m e k + + m e k min{p m m e k m } min{ + m e k + m e k + m e k The value of the above funtion only depends on the lass probability estimate m the onfusion matrix of the worker e k and the ost matrix } }

8 Artile submitted to Information Systems Researh; manusript no ISR-24-2 Appendix D: Simulation Results: Inferene Algorithms in a Stati System D Objet Atual Mislassifiation Cost Sine true lasses are known we an alulate the atual mislassifiation ost of eah objet under different inferene algorithms based on Proposition 2 We report the average atual mislassifiation ost of objets as a funtion of the average number of labels assigned per objet The results in Figure Da are obtained under the symmetri ost matrix a and the results in Figure Db are obtained under the asymmetri ost matrix b We see that under both ost speifiations and outperform and onsistently The performane gap beomes more pronouned when the ost matrix is asymmetri whih is not surprising sine and only fous on predition error rate while and take into aount the osts assoiated with different types of lassifiation errors when making preditions It is worth noting that ahieves similar performane as when the ost matrix is symmetri but possesses a lear advantage over when the ost matrix is asymmetri What auses the differential performane between and when the ost matrix is asymmetri? We turn to the basi assumption underlying algorithm that is workers error rates do not hange when labeling objets of varying degrees of easiness The onsequene is that is likely to produe overonfident or extreme lass probability estimates for diffiult objets See Appendix E for an explanation The overonfident estimates may not hange the label predition in symmetri ost setting but have an impat on the label predition in asymmetri ost setting For instane if the lass probability estimates for an objet are 8 2 using and 9 using when the ost matrix is a both and report ; however when the ost matrix is b reports but reports By inorporating objet easiness into inferene allows the employer to obtain more aurate lass probability estimates for eah objet yielding onsiderable improvements in ost redution when faing asymmetri mislassifiation osts D2 Worker Quality Estimation Auray Following the notations introdued in Setion 4 the quality measures for worker k using and are auray rate q k sum of worker messages y k = O k 2l k x k onfusion matrix e k and quality vetor ˆα k respetively Sine these measures are all at different sales and hard to ompare diretly we resort to Spearman s rank orrelation oeffiient whih provides a nonparametri estimate of the strength of assoiation between two ranked variables Table D shows how we alulate the Spearman orrelation for eah inferene algorithm where ρ XY denotes the Spearman s rho oeffiient between X and Y Algorithm Quality Measure Spearman Correlation Auray rate q k 5ρ α k qk + 5ρ α k qk Sum of worker messages y k 5ρ α k yk + 5ρ α k yk Confusion matrix e k 5ρ k α ek + 5ρ k α ek Quality vetor ˆα k 5ρ k α + 5ρ ˆαk k α ˆαk Table D Calulating the Spearman orrelation for different inferene algorithms

Artile submitted to Information Systems Researh; manusript no ISR-24-2 9 The orrelation results obtained using different inferene algorithms are presented in Figure D2a Contrary to our expetation does not exhibit superior performane over in estimating worker quality We attribute this to the uniform assignment of workers in the simulation Under the uniform assignment eah worker is likely to be assigned with a similar mixture of easy β > and diffiult β < objets Sine worker quality is omputed by aggregating over all objets one has labeled on average won t overestimate or underestimate the quality of workers We then turn to a nonuniform assignment setting in whih some workers are disproportionately assigned with more easy or diffiult objets Speifially we split the worker population into two halves: the first half is assigned with 75% easy objets and 25% diffiult objets while the seond half is assigned with 25% easy objets and 75% diffiult objets The orrelation results obtained under this nonuniform assignment are reported in Figure D2b whih demonstrates a slight advantage of over when the number of labels assigned to eah objet is relatively high As more labels are olleted the easiness estimates of the objets using beome more aurate leading to a more fair evaluation of worker quality 5 4 3 2 3 7 4 Figure D 2 4 6 8 2 4 6 8 a symmetri ost matrix a b asymmetri ost matrix b Average atual mislassifiation ost as a funtion of the average number of labels assigned per objet for different inferene algorithms in a stati system Spearman orrelation Figure D2 8 6 4 2 2 4 6 8 a uniform assignment Spearman orrelation 8 6 4 2 2 4 6 8 b nonuniform assignment Spearman orrelation between worker quality estimates and true quality values as a funtion of the average number of labels assigned per objet for different inferene algorithms in a stati system

Artile submitted to Information Systems Researh; manusript no ISR-24-2 Appendix E: Produes Overonfident Estimates for Diffiult Objets For illustration purpose we onsider a very simple ase where all the workers have homogenous labeling quality and the following relationship holds: α = α > As the objet reeives more labels and worker quality estimates beome more aurate we will have e e > 5 using and ˆα ˆα > using Under let us denote ˆξ ii = + e Sine ˆα ˆα i ˆβ i > ˆξ ii dereases as ˆβ is getting smaller ie the objet is more diffiult However under e ii is the same aross all the objets When the objet is suffiiently diffiult the following relationship ˆξ ii < e ii holds Suppose that the objet has olleted p positive labels and n negative labels For we have e p e n = e p e n + e p e e p e n n e p e n + e p e n + e e p n + e e p n For we have ˆξ p = ˆξ n ˆξ p ˆξ n + ˆξ p ˆξ n + ˆξ ˆξ p n + ˆξ ˆξ p n ˆξ p ˆξ n ˆξ p ˆξ n + ˆξ p ˆξ n Without loss of generality we assume that p > n Then the probability estimate of the most likely lass is + e + ˆξ ˆξ and the probability estimate of the most likely lass is e p n ˆξ < e holds when objet is suffiiently diffiult we have + e produes more onfident lass probability estimates for objet > e p n + ˆξ ˆξ Sine p n Therefore the p n To onfirm this is indeed the ase we plot the probability estimates of most likely lasses ie max{ } by and for the top % most diffiult objets in Figure E 2 whih learly shows that estimates are muh more extreme ie lose to than estimates Probability estimate of the most likely lass 9 8 7 6 5 Figure E The probability estimates of most likely lasses for the top % most diffiult objets 2 The results are obtained under the simulation setting in Setion 6

Artile submitted to Information Systems Researh; manusript no ISR-24-2 Appendix F: Supplementary Figures 5 4 3 2 5 2 9 6 3 2 9 6 23 3 37 a bluebird dataset symmetri ost a 5 4 3 2 2 9 6 23 3 37 b bluebird dataset asymmetri ost b 5 2 9 6 3 2 4 6 8 2 4 6 8 rte dataset symmetri ost a d rte dataset asymmetri ost b 4 3 2 5 2 9 6 3 2 4 6 8 2 4 6 8 Figure F e temp dataset symmetri ost a f temp dataset asymmetri ost b Average atual mislassifiation ost for different inferene algorithms on real-world datasets

2 Artile submitted to Information Systems Researh; manusript no ISR-24-2 You are here! Auray in lassifying negative objets 8 6 4 2 875 2 4 6 8 Bonus payment Auray in lassifying positive objets Figure F2 The interfae of bonus payment to workers Referenes Bernstein MS Little G Miller RC Hartmann B Akerman MS Karger DR Crowell D Panovih K 2 Soylent: A word proessor with a rowd inside Proeedings of the 23th annual ACM Symposium on User Interfae Software and Tehnology 33 322 Kittur A Smus B Khamkar S Kraut RE 2 CrowdForge: Crowdsouring omplex work Proeedings of the 24th Annual ACM Symposium on User Interfae Software and Tehnology 43 52 Kulkarni AP Can M Hartmann B 2 Turkomati: Automati reursive task and workflow design for mehanial turk Proeedings of the 2 Annual Conferene Extended Abstrats on Human Fators in Computing Systems 253 258 Little G Chilton LB Goldman M Miller R 2 Turkit: Human omputation algorithms on mehanial turk Proeedings of the 23th annual ACM Symposium on User Interfae Software and Tehnology 57 66 Whitehill J Ruvolo P Wu T Bergsma J Movellan J 29 Whose vote should ount more: Optimal integration of labels from labelers of unknown expertise Advanes in Neural Information Proessing Systems 235 243