Filtering with the Crowd

Size: px

Start display at page:

Download "Filtering with the Crowd"

Valentine Barber
5 years ago
Views:

1 Filtering with the Crowd LRI Benoît Groz, Ezra Levin, Isaco Meiljson, Tova Milo Tel-Aviv University Univ. Paris Saclay 15 Mars 1 1

2 Outline 1 The CrowdScreen framework Algorithms for computing good/optimal strategies 3 Experimental results

3 Broader perspective: CrowdSourcing project at TAU CrowdSourcing: engaging web users to contribute and process information. Use cases: Wikipedia, annotations for ML data (images,video, text processing), reviews, data cleaning... FP7 MODAS project (T. Milo et al.): develop foundations for the management of large-scale crowd-sourced data. Interface (NLP, query refining) Mining with the crowd (ontologies, association rules, data cleaning)... Query optimization (Planning queries, Filtering, Skyline) 3

4 Filtering with the Crowd s = 5% gluten-free cereals e = e1 = % errors Filtering in CrowdScreen s model Select with minimum number of tasks all cereals with gluten with τ = 1% of misclassification redundant tasks Compute sequential test in terms of e, e1, s, τ and budget m. Aim: minimize cost=#tasks Strong assumption: s, e, e1 known in advance. (, may use sampling)

5 Strategies Compute strategy when # Yes 3 1 5% of cereals contain gluten error probability. per answer we wish at most 1% error. we can afford at most m (say, 51) questions per cereal. # No # No # No questions # Yes questions # Yes questions : continuing point : accepting point : rejecting point : unreachable point Minimize expected cost for given error threshold and budget Here: error rates e, e 1 =., selectivity s =.5, error threshold τ =.1, budget m = 51 In general s.5 and e e 1 but similar shape... 5

6 Seems a hard problem... Problem: computing optimal strategy. (i.e., optimal stopping time in a sequential test) Complexity bounds Check all possible strategies: O( m ) Check all ladder strategies: O( m )... accept... reject Figure: ladder strategy Heuristics Probabilistic relaxation.

7 Contributions: outline Crowdscreen [Parameswaran et al, SIGMOD 1] Defined the framework A linear program to compute the optimal probabilistic strategy Two gradient-based heuristics in O(m 5 ); shrink and growth Our contributions Complexity Analysis and show that algorithms scale poorly Improve the complexity of both growth and shrink to O(m ), and remedy a flaw in growth Propose a scalable heuristic based on the well-known SPRT Establish connections between probabilistic and deterministic strategies. 7

8 Outline 1 The CrowdScreen framework Algorithms for computing good/optimal strategies 3 Experimental results

9 Panel of algorithms (a) truncated sprt (b) adaptsprt (c) ladder (d) shrink (e) linear Figure: Strategies returned for e =.5, e 1 =., s =., τ =.75, and m = 15. SPRT[Wald ]: stops when error < τ likelihood ratio LR(x, y) [ τ 1 τ, 1 τ τ ] LR(x, y) = Pr(reach(x, y) gluten) Pr(reach(x, y) gluten-free) s 1 s log(lr(x, y)) = log s 1 s + ( ) ( x log e 1 1 e + y log 1 e1 ) e : continuing point : accepting point : rejecting point : P stop =. : unreachable point : decision line : SPRT lines 9

10 Panel of algorithms (a) truncated sprt (b) adaptsprt (c) ladder (d) shrink (e) linear Figure: Strategies returned for e =.5, e 1 =., s =., τ =.75, and m = 15. Truncated SPRT: stops when error<τ or exceed m Truncation may raise error above τ. using binary search we can compute the optimal LR threshold to obtain expected error < τ. : continuing point : accepting point : rejecting point : P stop =. : unreachable point : decision line : SPRT lines 9

11 Panel of algorithms (a) truncated sprt (b) adaptsprt (c) ladder (d) shrink (e) linear Figure: Strategies returned for e =.5, e 1 =., s =., τ =.75, and m = 15. Number of ladder strategies: O( m /m) (a little fewer, but Ω( m /m 3 )) Error and Cost computed in amortized O(m). : continuing point : accepting point : rejecting point : P stop =. : unreachable point : decision line : SPRT lines Incremental evaluation/enumeration[knuth] 9

12 Panel of algorithms (a) truncated sprt (b) adaptsprt (c) ladder (d) shrink (e) linear Figure: Strategies returned for e =.5, e 1 =., s =., τ =.75, and m = 15. shrink: gradient-like heuristic For each (x, y) compute C E = C C E E. Add terminating point at maximum with E τ. Proposition We can compute all ratios C E (x, y) in O(m ). : continuing point : accepting point : rejecting point : P stop =. : unreachable point : decision line : SPRT lines 9

13 Panel of algorithms (a) truncated sprt (b) adaptsprt (c) ladder (d) shrink (e) linear Figure: Strategies returned for e =.5, e 1 =., s =., τ =.75, and m = 15. Probabilistic strategy: lower cost& linear program for optimal strategy Theorem There is an optimal strategy with a single probabilistic point (unique in general). Instead of linear program, we can use shrink with probabilistic point. : continuing point : accepting point : rejecting point : P stop =. : unreachable point : decision line : SPRT lines 9

14 Outline 1 The CrowdScreen framework Algorithms for computing good/optimal strategies 3 Experimental results 1

15 Panel of algorithms (a) truncated sprt (b) adaptsprt (c) ladder (d) shrink (e) linear Figure: Strategies returned for e =.5, e 1 =., s =., τ =.75, and m = 15. SPRT AdaptSprt ladder shrink linear cost error : continuing point : accepting point : rejecting point : P stop =. : unreachable point : decision line : SPRT lines 11

16 Experiments on real Crowd question s e e 1 Q1 photos from Australia Q photos from Greece or Cyprus..7.3 Q3 dishes containing dairy Q dishes containing onions Q5 dishes containing garlic... Q dishes containing eggs Figure: Question parameters cost : linear : ladder : shrink : rect : AdaptSprt : expected value 5 Q1 Q Q3 Q Figure: Average cost per item (with m = 1, τ =.1) 1

17 Running time running time (s) ladder exact xcheck gurobi shrink shrinkhp adaptsprt budget m Figure: Average running time of algorithms on random instances (e, e1, τ, m, s) ladder (PyPy) naive (PyPy) naive (cpython) (.5,.,.75, 15,.).s.s 3min (.,.5,.5, 1,.) 5s.5min.7h (e, e1, τ, m, s) shrink (PyPy) naive (PyPy) naive (cpython) (.5,.,.75,,.).s.s 3s (.5,.,.75,,.) 1.5s 1 min > 5h Figure: Running time for ladder and shrink variants 13

18 Figure: For e =., e 1 =.5, τ =.5, s =.: cost, and sensitivity (only shrink, m=15). 1 Sensitivity of the strategy (τ =.1, m = 1) expected cost "ladder" "xcheckdet" "gurobi" "shrink" "adaptsprt" 1 5 budget m e1 e : cost/error as percentage of original one e e : strategy parameters error cost

19 Conclusion CrowdScreeen s purpose: classify multiple items according to predefined strategy. Problem: compute "optimal" stopping rules given parameters. Our contributions: optimize previous algorithms explain or fix properties observed in original framework establish connection between shrink and probabilistic strategy experimental evaluation 15

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies

On the Complexity of Mining Itemsets from the Crowd Using Taxonomies Antoine Amarilli 1,2 Yael Amsterdamer 1 Tova Milo 1 1 Tel Aviv University, Tel Aviv, Israel 2 École normale supérieure, Paris, France