Appendix. Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA Qihang Lin

Size: px

Start display at page:

Download "Appendix. Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA Qihang Lin"

Barrie Wade
5 years ago
Views:

1 Xi Chen Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 53, USA Qihang Lin Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA 53, USA Dengyong Zhou Machine Learning Group, Microsoft Research, Redmond, WA 9805, USA Proof of Corollary Recall that Ia, b Prθ θ Betaa, b where is the beta function t a t b dt, It is easy to see that Ia, b > Ia, b > Ia, b We re-write Ia, b as follows Ia, b 0 t a t b dt t b t a dt, where the second equality is obtained by setting t : t Then we have: Ia, b Ia, b t a t b t b t a dt a b t t a t b dt t t t > When a > b, t a b Since t >, t > and hence Ia, b Ia, b > 0, ie, Ia, b > a b t When a b, t and Ia, b When a b t a < b, t < and Ia, b < Proof of Proposition 3 We use the proof technique in Xie & Frazier, 0 to prove Proposition 3 By Proposition, we first obtain the value function: V S 0 sup E i H + i H F T i H T i H T K sup E hp T i i To decompose the final accuracy K i hp i T into the intermediate reward at each stage, we define G 0 K i hp i 0 and G t+ K t+ i hpi K i hp Then, K i hp i T can be decomposed as: K i hp i T G 0 + T t0 G t+ The value function can now be re-written as follows: V S 0 T G 0 S 0 + sup E G t+ t0 T G 0 S 0 + sup E EG t+ F t t0 T G 0 S 0 + sup E EG t+ S t, t0 Here, the first inequality is true because G 0 is determinant and independent of, the second inequality is due to the tower property of conditional expectation and the third inequality holds because P t+ i and Pi i, and thus, G t+ depend on F t only through S t and We define intermediate expected reward gained by labeling the -th instance at the state S t as follows: RS t, EG t+ S t, K K E hp t+ i hp S t, i i E hp t+ hp t S t, 3

2 The last equation is due to the fact that only P t will be changed if the -th instance is labeled next With the expected reward function in place, our value function takes the following form: V S 0 G 0 s + sup T E RS t, S 0 4 t0 3 Proof of Proposition 3 To prove the failure of deterministic KG, we first show a key property for the expected reward function: Ra, b a hia +, b hia, b 5 a + b + b hia, b + hia, b a + b Lemma 3 When a, b are positive integers, if a b, Ra, b a aba,a and if a b, Ra, b 0 To prove lemma 3, we first present several basic properties for and Ia, b, which will be used in all the following theorems and proofs Properties for : Properties for : Bb, a 6 Ba +, b a a + b 7 Ba, b + b a + b 8 Ia, b Ib, a 9 Ia +, b Ia, b + a+b a Ia, b + Ia, b a+b b 0 The properties for Ia, b are derived from the basic property of regularized incomplete beta function Proof When a b, by Corollary, we have Ia +, b >, Ia, b and Ia, b + < Therefore, the expected reward 5 takes the following form: Ra, b Ia +, a Ia, a + Ia, a + Ia, a Ia +, a Ia, a a aba, a When a > b, since a, b are integers, we have a b + and hence Ia +, b >, Ia, b >, Ia, b + according to Corollary The expected reward 5 now becomes: Ra, b a Ia +, b + b Ia, b + Ia, b a + b a + b a t t a t b dt a + b Ba +, b + b a + b Ba, b + Ia, b Ia, b Ia, b 0 t a t t b dt t + t t a t b dt Ia, b Here we use 7 and 8 to show that b a+b Ba,b+ Ba,b a a+b Ba+,b When a b, we can prove Ra, b 0 in a similar way With Lemma 3 in place, the proof for Proposition 3 is straightforward Recall that the deterministic KG policy chooses the next instance according to arg max i RS t, i arg max Ra t i, b t i, i and breaks the tie by selecting the one with the smallest index Since Ra, b > 0 if and only if a b, at the initial stage t 0, Ra 0 i, b0 i > 0 for those instances i E {i : a 0 i b 0 i } The policy will first select i 0 E with the largest Ra 0 i, b0 i After obtaining the label y i0, either a 0 i 0 or b 0 i 0 will add one and hence a i 0 b i 0 and Ra i 0, b i 0 0 The policy will select another instance i E with the current largest expected reward and the expected reward for i after obtaining the label y i will then become zero As a consequence, the KG policy will label each instance in E for the first E stages and Ra E i, b E i 0 for all i {,, K} Then the deterministic policy will break the tie selecting the first instance to label From now on, for any t E, if a t b t, then the expected reward Ra t, b t 0 Since the expected reward for other instances are all zero, the policy will still label the first instance On the other hand, if a t b t, and the first instance is the only one with the positive expected reward and the policy will label it Thus Proposition 3 is proved Remark For randomized KG, after getting one label for each instance in E for the first E stages, the expected reward for each instance has become zero Then randomized KG will uniformly select one instance to label At any stage t E, if there exists one instance

3 i at most one instance with a t i b t i, the KG policy will provide the next label for i; otherwise, it will randomly select an instance to label 4 Proof of Theorem 3 To prove the consistency of the optimistic KG policy, we first show the exact values for R + α a, b maxr a, b, R a, b b When a b + : R a, b Ia +, b Ia, b a+b a > 0; R a, b Ia, b + Ia, b a+b b < 0 Therefore, When a b: R + a, b R a, b a+b a > 0 R a, b Ia +, a Ia, a a aba, a ; R a, b Ia, a + Ia, a a aba, a Therefore, we have R R and R + a, b R a, b R a, b a aba, a > 0 3 When b a: R a, b Ia, b Ia +, b a+b a < 0; R a, b Ia, b Ia, b + a+b b > 0 Therefore R + a, b R a, b a+b b > 0 For better visualization, we plot values of R + a, b for different a, b in Figure As we can see R + a, b > 0 for any positive integers a, b, we first prove that lim a+b R + a, b 0 in the following Lemma Lemma 4 Properties for R + a, b: Ra, b is symmetric, ie, R + a, b R + b, a a Figure Illustration of R + a, b lim a R + a, a 0 3 For any fixed a, R + a + k, a k R + a k, a + k is monotonically decreasing in k for k 0,, a 4 When a b, for any fixed b, R + a, b is monotonically decreasing in a By the symmetry of R + a, b, when b a, for any fixed a, R + a, b is monotonically decreasing in b By the above four properties, we have lim a+b R + a, b 0 Proof We first prove these four properties Property : By the fact that Bb, a, the symmetry of R + a, b is straightforward R Property : For a >, + a,a R + a,a a a < and hence R + a, a is monotonically decreasing in a Moreover, a R + a, a R + i, i i a R +, i i R +, e a i i Since lim a a i i and R + a, a 0, lim a R + a, a 0 Property 3: For any k 0, R + a + k +, a k + R + a + k, a k a + kba + k, a k a + k + Ba + k +, a k + a k + a + k + <

4 Property 4: When a b, for any fixed b: R + a +, b R + a, b a a + Ba +, b aa + b aa + < According to the third property, when a + b is an even number, we have R + a, b < R + a+b, a+b According to the fourth property, when a + b is an odd number and a b +, we have R + a, b < R + a, b < R + a+b, a+b ; while when a + b is an odd number and a b, we have R + a, b < R + a, b < R + a+b, a+b Therefore, R + a, b < R + a + b + b, a According to the second property such that lim a R + a, a 0, we conclude that lim a+b R + a, b 0 Using Lemma 4, we first show that, in any sample path, the optimistic KG will label each instance infinitely many times as T goes to infinity Let η i T be a random variable representing the number of times that the i-th instance has been labeled until the stage T using optimistic KG Given a sample path ω, let Iω {i : lim T η i T ω < } be the set of instances that has been labeled only finite number of times as T goes to infinity in this sample path We need to prove that Iω is an empty set for any ω We prove it by contradiction Assuming that Iω is not empty, then after a certain stage T, instances in Iω will never be labeled By Lemma 4, for any j I c, lim T R + a T j ω, bt j ω 0 Therefore, there will exist T > T such that: max j I R+ a T j ω, b T j ω < max c i I R+ a T i ω, b T i ω max i I R+ a T i ω, b T i ω Then according to the optimistic KG policy, the next instance to be labeled must be in Iω, which leads to the contradiction Therefore, Iω will be an empty set for any sample path ω Let Yi s be the random variable which takes the value if the s-th label of the i-th instance is and the value if the s-th label is 0 It is easy to see that EYi s θ i PrYi s θ i θ i Hence, Yi s, s,, are iid random variables By the fact that lim T η T i in all sample paths and using the strong law of large number, we conclude that, conditioning on θ i, i,, K, the conditional probability of a T i b T ηit i s lim lim Y i s EYi s θ i θ i T η i T T η i T for all i,, K, is one According to Proposition, we have H T {i : a T i b T i } and H {i : θ i } The accuracy is AccT K H T H + HT c H c We have: Pr lim AccT T {θi}k i Pr lim T HT H + HT c H c K {θ i} K i a T i b T i Pr lim θ i, i,, K {θ i} K i T η it, whenever θ i for all i The last inequality is due to the fact that, as long as θ i is not in any i, any sample path that gives the event a lim T i bt i T η it θ i, i,, K also gives the event lim T a T i bt i sgnθ i +, which further implies lim T H T H + HT c H c K Finally, we have: Pr lim AccT T [ ] E {θi } K Pr lim AccT i T {θi}k i [ ] E {θi :θ i } K Pr lim AccT i T {θi}k i [], E {θi :θ i } K i where the second equality is because {θ i : i, θ i } is a zero measure set 5 Incorporate Workers Reliability We assume there are K instances with the soft-label θ i Betaa 0 i, b0 i and M workers with the reliability ρ j Betac 0 j, d0 j Given the decision on labeling the i- th instance by the j-th worker, we have the probability of the outcome Z ij : PrZ ij θ i, ρ j θ i ρ j + θ i ρ j PrZ ij θ i, ρ j θ i ρ j + θ i ρ j 3 We approximate the posterior so that at any stage for all i, j, θ i and ρ j will follow Beta distributions In particular, assuming at the current state θ i Betaa i, b i and ρ j Betac j, d j, the posterior distribution conditioned on Z ij takes the following form: PrZij θi, ρjbetaai, bibetacj, dj pθ i, ρ j Z ij PrZ ij PrZij θi, ρjbetaai, bibetacj, dj pθ i, ρ j Z ij PrZ ij

5 where the likelihood PrZ ij z θ i, ρ j for z, is defined in and 3 respectively and PrZ ij EPrZ ij θ i, ρ j Eθ i Eρ j + Eθ i Eρ j a i c j + a i + b i c j + d j b i a i + b i PrZ ij EPrZ ij θ i, ρ j d j c j + d j Eθ i Eρ j + Eθ i Eρ j b i c j + a i + b i c j + d j a i a i + b i d j c j + d j The posterior distributions pθ i, p j Z ij z no longer takes the form of the product of Beta distributions on θ i and p j Therefore, we use variational approximation by first assuming the conditional independence of θ i and ρ j : pθ i, ρ j Z ij z pθ i Z ij zpρ j Z ij z In particular, we have the exact form for the marginal distributions: pθ i Z ij pρ j Z ij pθ i Z ij pρ j Z ij θieρj + θi Eρj Betaa i, b i PrZ ij Eθiρj + Eθi ρj Betac j, d j PrZ ij θieρj + θi Eρj Betaa i, b i PrZ ij Eθiρj + Eθi ρj Betac j, d j PrZ ij To approximate the marginal distribution as Beta distribution, we use the moment matching technique In particular, we approximate such that θ i Z ij z Betaã i z, b i z, Ẽ zθ i E pθi Z ij zθ i ã iz ã iz + b iz, 4 Ẽ zθ i E pθi Z ij zθ i ã izã iz + ã iz + b izã iz + b iz +, where ã iz and ã iz+ b iz 5 ã izã iz+ are the ã iz+ b izã iz+ b iz+ first and second order moment of Betaã i z, b i z To make 4 and 5 hold, we have: ã i z Ẽzθ Ẽ z θ i i Ẽzθi, 6 Ẽ z θi Ẽz θ i bi z Ẽzθ Ẽ z θ i i Ẽzθi 7 Ẽ z θi Ẽz θ i Similarly, we approximate such that ρ j Z ij z Beta c j z, d j z, Ẽ zρ j E pρj Z ij zρ j c jz c jz + d jz, 8 Ẽ zρ j E pρj Z ij zρ j c jz c jz + c jz + d jz c jz + d jz +, where c jz c jz+ d and jz 9 c jz c jz+ c jz+ d jz c jz+ d are the jz+ first and second order moment of Beta c j z, d j z To make 4 and 5 hold, we have: c j z Ẽzρ Ẽ z ρ j j Ẽzρ j, 0 Ẽ z ρ j Ẽz ρ j d j z Ẽzρ Ẽ z ρ j j Ẽzρ j Ẽ z ρ j Ẽz ρ j Furthermore, we can compute the exact values for Ẽ z θ i, Ẽzθi, Ẽzρ j and Ẽzρ j as follows Ẽ θ i Eθ i Eρ j + Eθ i Eθi Eρ j pz ij aiai + cj + bidj a i + b i + a ic j + b id j Ẽ θi Eθ3 i Eρ j + Eθi Eθi 3 Eρ j pz ij a ia i + a i + c j + b id j a i + b i + a i + b i + a ic j + b id j Ẽ θ i Eθi Eθ i Eρ j + Eθi Eρ j pz ij aibicj + ai + dj a i + b i + b ic j + a id j Ẽ θi Eθ i Eθi 3 Eρ j + Eθi 3 Eρ j pz ij a ia i + b ic j + a i + d j a i + b i + a i + b i + b ic j + a id j Ẽ ρ j EθiEρ j + Eθ ieρ j Eρ j pz ij cjaicj + + bidj c j + d j + a ic j + b id j Ẽ ρ j EθiEρ3 j + Eθ ieρ j Eρ 3 j pz ij c jc j + a ic j + + b id j c j + d j + c j + d j + a ic j + b id j

6 Algorithm Optimistic Knowledge Gradient with Workers Reliability Input: Parameters of prior distributions for instances {a 0 i, b0 i }K i and for workers {c0 j, d0 j }M j The total budget T for t 0,, T do Select the next instance to label and the next worker j t to label according to: Here, j t arg max R + a t i, b t i, c t j, d t j i {,,K},j {,,M} R + a t i, bt i, ct j, dt j maxrat i, bt i, ct j, dt j, Rat i, bt i, ct j, dt j Acquire the label Z itj t {, } of the i-th instance from the j-th worker 3 Update the posterior by setting: a t+ ã t Z itj t b t+ b t Z itj t c t+ j t c t j t Z itj t d t+ j t d t j t Z itj t, and all parameters for i and j j t remain the same end for Output: The positive set H T {i : a T i b T i } Ẽ ρ j EθiEρ j + Eθ ieρ j Eρ j pz ij cjbicj + + aidj c j + d j + b ic j + a id j Ẽ ρ j EθiEρ3 j + Eθ ieρ j Eρ 3 j pz ij c jc j + b ic j + + a id j c j + d j + c j + d j + b ic j + a id j Assuming at a certain stage, θ i for the i-th instance has the Beta posterior Betaa i, b i and ρ j for the j-th worker has the Beta posterior Betac j, d j The reward of getting label for the i-th instance from the j-th worker and getting label - are: R a i, b i, c j, d j hiã iz, b iz hia i, b i R a i, b i, c j, d j hiã iz, b iz hia i, b i, 3 where ã i z ± and b i z ± are defined in 0 and, which further depend on c j and d j With the reward in place, we present the optimistic knowledge gradient algorithm for budget allocation with the modeling of workers reliability in Algorithm 6 Extensions 6 Incorporating Feature Information When each instance is associated with a p-dimensional feature vector x i R p, we incorporate the feature information in our budget allocation problem by assuming: θ i σ w, x i where σx +exp{ x} + exp{ w, x i }, 4 is the sigmoid function and w is assumed to be drawn from a Gaussian prior Nµ 0, Σ 0 At the t-th stage with the state S t µ t, Σ t and w µ t, Σ t, one decides to label the -th instance according to a certain policy eg, KG and observes the label y it {, } The posterior distribution pw y it, S t py it wpw S t has the following log-likelihood: ln pw y it, S t ln py it w + ln pw S t + const y it ln σ w, x it + y it ln σ w, x it w µ t Ω t w µ t + const, where Ω t Σ t is the precision matrix To approximate pw y it, µ t, Σ t by a Gaussian distribution Nµ t+, Σ t+, we use the Laplace method see Chapter 44 in Bishop, 007 In particular, we define the mean of the posterior Gaussian using the MAP maximum a posteriori estimator of w: µ t+ arg max ln pw y it, S t 5 w And µ t+ can be solved by Newton s method and the precision matrix: Ω t+ ln pw y it, S t wµt+ Ω t + σµ t+x it+ σµ t+x it+ x it+ x + By Sherman-Morrison formula, the covariance matrix Σ t+ Ω t+ Σ t σµ t+x it σµ t+ x it + σµ t+ x σµ t+ xx Σ tx it Σ tx it+ x Σ t We also calculate the transition probability of y it and y it as follows: Pry it S t, py it wpw S t dw σw x i pw S t dw σµ i κs i,

7 where κs i + s i /8 / and µ i µ t, x i and s i x i Σ tx i To calculate the reward function, in addition to the transition probability, we also need to compute: P Prθ i F t Pr + exp{ w tx i } wt Nµ t, Σ t Prw tx i 0 w t Nµ t, Σ t δc w, x i Nw µ t, Σ t dw dc, 0 w where δ is the Dirac delta functionlet pc δc w, x i Nw µ t, Σ t dw w Since the marginal of a Gaussian distribution is still a Gaussian, pc is a univariate-gaussian distribution with the mean and variance: µ i Ec Ew, x i µ t, x i s i Varc x i Covw, wx i x i Σ t x i Therefore, we have: P t i 0 pcdc Φ where Φ is the Gaussian CDF µ i, 6 s i With Pi t and transition probability in place, the expected reward in value function takes the following form : K K RS t, E hp t+ i hp S t, 7 i i We note that since w will affect all P, the summation from to K in 7 can not be omitted and hence 7 cannot be written as E hp t+ hp t S t, in 3 In this problem, the myopic KG or optimistic KG need to solve OT K optimization problems to compute the mean of the posterior as in 5, which could be computationally quite expensive One possibility to address this problem is to use the variational Bayesian logistic regression Jaakkola & Jordan, 000, which could lead to a faster optimization procedure 6 Multi-Class Setting In multi-class setting with C different classes, we assume that the i-th instance is associated with a probability vector θ i θ i, θ ic, where θ ic is the probability that the i-th instance belongs to the class c and C i θ ic We assume that θ i has a Dirichlet prior θ i Dirα 0 i and our initial state S0 is a K C matrix with α 0 i as its i-th row At each stage t with the current state S t, we determine an instance to label and collect its label y it {,, C}, which follows the categorical distribution: py it C c θiy c c Since the Dirichlet is the conjugate prior of the categorical distribution, the next state induced by the posterior distribution is: S t+ S t + δ yit and S t+ i S for all i Here δ c is a row vector with one at the c- th entry and zeros at all other entries The transition probability: Pry it c S t, Eθ itc S t α tc C c αt c In multi-class problem, at the final stage T when all budget is used up, we construct the set Hc T for each class c to maximize the conditional expected classification accuracy: K C {H T c }C c arg max E Ii H cii H c F T Hc {,,C},Hc H c i c K C arg max Ii H c Pr i H c F T Hc {,,C},Hc H c i c 8 Here, Hc {i : θ ic θ ic, c c} is the true set of instances in the class c The set Hc T consists of instances that belong to class c Therefore, {Hc T } C c should form a partition of all instances {,, K} Let P T ic Pri H c F T Prθ ic θ i c, c c F T 9 To maximize RHS of 8, we have H T c {i : P T ic P T i c, c c} 30 If there is i belongs to more than one H T c, we only assign it to the one with the smallest index c The maximum conditional expected accuracy takes the form: K i max c {,C} P T ic Then the value function can be defined as: V S 0 sup where P T i E K sup E sup i c E K C Ii H T S c Ii H c 0 i c K E hp T S i 0, i P T i,, P T ic and hp T i 3 C Ii H T c Ii H c FT S 0 max P ic T c {,C} 3

8 Following Proposition, let Pic t Pri H c F t and P t i P,, P ic t By defining intermediate reward function at each stage: RS t, E hp t+ hp t S t, The value function can be re-written as: T V S 0 G 0 S 0 + sup E RS t, S 0, t0 where G 0 S 0 K i hp0 i Since the reward function only depends on S t α t R C +, we can define the reward function in a more explicit way by defining: References Bishop, C M Pattern Recognition and Machine Learning Springer, 007 Jaakkola, T and Jordan, M I Bayesian parameter estimation via variational methods Statistics and Computing, 0:5 37, 000 Xie, J and Frazier, P I Sequential bayes-optimal policies for multiple comparions with a control Technical report, Cornell University, 0 Rα C c α c C c α c hiα + δ c hiα Here δ c be a row vector of length C with one at the c-th entry and zeros at all other entries; and Iα I α,, I C α where I c α Prθ c θ c, c c θ Dirα 33 Therefore, we have RS t, Rα t To evaluate the reward Rα, the major bottleneck is how to compute I c α efficiently Directly taking the C-dimensional integration on the region {θ c θ c, c c} C will be computationally very expensive, where C denotes the C-dimensional simplex Therefore, we propose a method to convert the computation of I c α into a one-dimensional integration It is known that to generate θ Dirα, it is equivalent to generate {X c } C c with X c Gammaα c, X and let θ c C c Then θ θ,, θ C will fol- c Xc low Dirα Therefore, we have: I c α PrX c X c, c c X c Gammaα c, 34 It is easy to see that I cα 35 C f Gammax c; α c, dx dx C 0 x xc xc 0 0 x C xc c f Gammax c; α c, c c F Gammax c; α c, dx c, xc 0 where f Gamma x; α c, is the density function of Gamma distribution with the parameter α c, and F Gamma x c ; α c, is the CDF of Gamma distribution at x c with the parameter α c, In many softwares, F Gamma x c ; α c, can be calculated very efficiently without an explicit integration Therefore, we can evaluate I c α by performing only a one-dimensional numerical integration as in 35 We could also use Monte-Carlo approximation to further accelerate the computation in 35

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm