Models of collective inference

Size: px

Start display at page:

Download "Models of collective inference"

Paulina Jennings
5 years ago
Views:

1 Models of collective inference Laurent Massoulié (Microsoft Research-Inria Joint Centre) Mesrob I. Ohannessian (University of California, San Diego) Alexandre Proutière (KTH Royal Institute of Technology) Kuang Xu (Stanford)

2 Outline Spreading news to interested users How to jointly achieve news -categorization by users, and -dissemination to interested users [LM, M.Ohanessian, A. Proutière, Sigmetrics 15] Information-processing systems Treatment of inference tasks by pool of agents with limited capacities (ex: crowdsourcing) [LM, K. Xu 15]

recommended news) Goal: sequentially select nodes UU 1,, UU II up to stopping time

3 A sequential decision problem Unknown latent variable θθ TT (news topic) Nodes uu UU, if selected, generate feedback YY uu ~BBBBBB pp uuuu (appreciation of recommended news) Goal: sequentially select nodes UU 1,, UU II up to stopping time II to minimize where for (missed opportunities) Assumption: partially known preferences (spam)

4 Related Work Multi-Armed Bandits (MAB): Contextual [Li et al. 2010] Infinitely many arms [Berry et al. 1997] Secretary problem [Ferguson 1989] Best arm identification [Kaufmann et al. 2014] Distinguishing features: Partial information on parameters Finitely many, exhaustible arms Stopping time, trades off two regrets Find all good arms, not the best

5 Pseudo-Bayesian Updates Pseudo-preferences: sensitivity Initial prior: Pseudo-posterior update:

6 Greedy-Bayes Algorithm Initialize pseudo-preferences and prior At each step i Make greedy selection Based on response, update pseudo-posterior Stop if threshold

7 How good is Greedy-Bayes? Separation: Main Theorem: If and, Then, under Greedy-Bayes: Moreover can be such that for any algorithm (uniformly at least as good as Greedy-Bayes):

8 The tool: Bernstein s inequality for martingales [Freedman 75] Martingale MM ii ii 1 with compensator MM ii and increments bounded by rr satisfies for any stopping time II

9 Proof sketch bound on spamming regret CC 1 Goal: Key quantity:

10 Proof sketch bound on spamming regret CC 1 Martingale analysis of RR ii : RR ii = AA ii + MM ii where MM ii martingale with increments bounded by O δδ, and for some ZZ ii ZZ ii : Hence: then apply Freedman s inequality

Lower regret with extra structure Topic that

for any algorithm (uniformly at least as good as

11 Lower regret with extra structure Topic that many like who dislike other topics (*) Theorem: Under (*), if Greedy-Bayes has regret then for topic θθ Constant in U Moreover can be such that for any algorithm (uniformly at least as good as Greedy-Bayes) Other conditions give log T instead of T

12 Thompson Greedy-Bayes sampling Algorithm Algorithm Given matrix : Initialize pseudo-preferences and prior. At each step i Make a greedy selection: Observe the response Update the pseudo-posterior: Stop if:

Example Groups of users, each prefers 1 topic

Groups of users, each prefers ½ of all topics

13 Example Groups of users, each prefers 1 topic An additional topic is pure spam When truth the truth is not is spam, exploration growth is linear has to with continue T Easier Case Groups of users, each prefers ½ of all topics Topic identified by bisection, regret of order log T

14 Summary Push-based news dissemination Extension of multi-armed bandit ideas Greedy-Bayes algorithm Simple yet provably order-optimal Robust, uses only partial preference information Adaptive to structure Implications Benchmark for push-pull decentralization Importance of leveraging negative feedback

15 Open questions Is there a simple characterization in terms of of the optimal regret? Is Greedy-Bayes (or Thompson Sampling) order-optimal under more general conditions? partial answers: Toy examples where greed fails to probe highly informative nodes with high expected contribution to regret Hope for somewhat general conditions under which Greedy Bayes performs near-bisection

16 Capacity of Information Processing Systems that perform inference/categorization tasks, based on queries to resource-limited agents Systems Crowd sourcing Human experts Machine learning (e.g., clustering) Cloud clusters Medical lab diagnostics Specialized lab machines Output based on large number of noisy inspections Goal: identify efficient processing which uses minimum amount of resources

17 Problem Formulation Jobs arrive to system as unit rate Poisson process Job i has hidden label H i in finite set H H i generated i.i.d from distribution, π H i ~ π Waiting Room H i = cat (hidden)

18 Problem Formulation Goal: output estimated label, H i, s.t. H i ~ π Waiting Room H i = cat (hidden) if H i if H i = cat = dog

19 System resources m experts of which r k fraction is type-k Type-k expert when inspecting type h- job produces random response X: for distributions { f(h,k,*) } assumed known Single job inspection occupies an expert for time Exp(1) Based on inspection history (x 1, k 1 ), (x 2,k 2 ),. (x n,k n ), Could tag job with ML estimator

20 Learning System H i ~ π Waiting Room H i = cat (hidden) H i (x,1) K K m*r 2

21 Inspection Policy An inspection policy, ϕ, decides 1. Which experts to inspect which jobs - based on past history 2. When a job will be allowed to depart from the system

22 Capacity Efficiency m ϕ (π,δ) = smallest number of experts needed for stability under policy ϕ job type distribution π Estimation error δ Let m*(π,δ)= min ϕ m ϕ (π,δ) Definition: ϕ is capacity efficient, if for all π

23 Main Result Theorem 1. There exists a capacity efficient policy ϕ, s.t. 2. The policy ϕ does not require knowledge of π

24 Main Result (cont.) Makes no restrictions on response distributions {f(h,k,*)} Distribution π may be unknown and change over time advantageous to have Upper bound independent of π π-oblivious inspection policy Constant c is explicit (but messy), and is proportional to 1/ (min k r k ) total number of labels Ratio between KL-divergences among distributions of inspection results

25 Related Work Classical sequential hypothesis testing Single expert by Wald 1945, Multi-expert model by Chernoff 1959 Costs measured by sample complexity Does not involve finite resource constraint or simultaneous processing of multiple jobs Max-Weight policy in network resource allocation Policy contains a variant of Max-Weight policy as a submodule Need to work with approximate labels and resolve synchronicity issues

26 Proof Outline Main Steps: 1. Translate estimation accuracy into service requirements 2. Design processing architecture to balance contention for services among different job labels Technical ingredients: 1. ``Min-weight adaptive policy using noisy job labels 2. Fluid model to prove stability over multiple stages

27 From Statistics to Services Consider single job i Suppose i leaves system upon receiving N inspections, with history Write Define likelihood ratio:

28 From Statistics to Services Lemma (Sufficiency of a Likelihood Ratio Test) Define event Then Lemma (Necessity) If decision rule yields error < δ then it must be that

29 From Statistics to Services For true job type H i =h, service requirement of job i: For all ll h need to inspect job i until S(h,l) reaches ln(1/δ) Job i arrives with vector of workloads with initial level = ln(1/δ) for all coordinates ll h Job of true type h when inspected by type-k expert receives ``service Expected service amount: KL-divergence

30 Lower bound on performance If we knew true labels of all jobs and label distribution π, we would need m as large as solution mm(ππ, δδ) of LP: Where n h,k : Number of inspections of h-labeled job by type-k expert Solution of LP, mm ππ, δδ = C ln( 1 δδ ) : lower bound on optimal capacity requirement

31 Processing Architecture Preparation stage: crude estimates of types from randomized inspections Adaptive stage: use crude estimates to ``boot-strap adaptive allocation policy for bulk of inspections and refinement of type estimates Generate approximately optimal n h,k as in LP Residual stage: fix poorly learned labels from Adaptive Stage with randomized inspections

32 System size requirements Preparation stage Short: inspections produces rough estimates Adaptive stage Long: produces good estimates with high probability Residual stage Long produces good estimates only invoked with small probability

33 System size requirements Total overhead compared to m*

34 Conclusions Considered resource allocation issues for learning systems Leads to unusual combination of sequential hypothesis testing with network resource scheduling Outlook Simpler solutions with better performance? Impact of number of types, and structure of expertise?

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised