Thompson sampling for web optimisation. 29 Jan 2016 David S. Leslie

Size: px

Start display at page:

Download "Thompson sampling for web optimisation. 29 Jan 2016 David S. Leslie"

James Cain
6 years ago
Views:

1 Thompson sampling for web optimisation 29 Jan 2016 David S. Leslie

2 Plan Contextual bandits on the web Thompson sampling in bandits Selecting multiple adverts

3 Plan Contextual bandits on the web Thompson sampling in bandits Selecting multiple adverts Optimising a web server

4 Contextual bandits... Receive state signal x t Select a t from a finite set of actions A Rewards stationary over time, but depend on both x t and a t r t = r(x t, a t ) + ɛ t

5 ... on the web

6 Natural solution method r t = r(x t, a t ) + ɛ t For each a A estimate the function r(, a) of x using some statistical procedure When x t is presented, calculate ˆr t (x t, a) for each a and select an action Objective Maximise average reward, minimise regret, select correct actions eventually

7 Natural solution method r t = r(x t, a t ) + ɛ t For each a A estimate the function r(, a) of x using some statistical procedure When x t is presented, calculate p(r(x t, a) H t ) for each a and select an action Objective Maximise average reward, minimise regret, select correct actions eventually

8 Simple bandits L R Receive state signal x t Finite set of actions a A Rewards stationary over time, but depend on x t and a t

9 Simple bandits L R Receive state signal x t Finite set of actions a A Rewards stationary over time, but depend on x t and a t r t = r(a t ) + ɛ t Estimate r(l) and r(r) using very simple statistics On trial t, calculate p(r(a) H t ) for each a and select an action

10 Solution methods Full Bayesian decision theory (Gittins indices etc) Beautiful optimality theory Action selected optimises the true objective Marginalises over all possible future outcomes Impossible to use in all but the simplest settings Alternative approach Heuristics to balance exploration and exploitation. Often involve randomisation

11 Undirected action selection Select based purely on expected values ˆr t (a) Greedy: Action a t maximises ˆr t (a) ɛ-greedy: Select greedy action with prob 1 ɛ, otherwise explore a random action Softmax: P(a t = a H t ) exp {ˆr t (a)/τ}

12 Spot the difference! p(r H) p(r H) r r Solid lines are posterior density of the expected reward for red/blue actions. Dashed lines are the means of these distributions. Undirected methods treat left and right panels identically.

13 Myopic action selection Give up on full optimality. Heuristics, usually using more than just ˆr t (a), to explore sensibly

14 Myopic action selection Give up on full optimality. Heuristics, usually using more than just ˆr t (a), to explore sensibly Optimism in face of uncertainty: create confidence intervals for each action, select action with highest top of CI.

15 Myopic action selection Give up on full optimality. Heuristics, usually using more than just ˆr t (a), to explore sensibly Optimism in face of uncertainty: create confidence intervals for each action, select action with highest top of CI. Thompson sampling: sample a value from the posterior for each action, select action with highest sample

16 Myopic action selection Give up on full optimality. Heuristics, usually using more than just ˆr t (a), to explore sensibly Optimism in face of uncertainty: create confidence intervals for each action, select action with highest top of CI. Thompson sampling: sample a value from the posterior for each action, select action with highest sample Main idea CI and posterior both narrow as more data have been observed for that action: exploration more likely for less-visited actions.

17 Thompson sampling properties Posteriors over action values Thompson sampling Probabilistic action selection P(a t = a H t ) = P(r(a) is maximal H t ) Proof idea: Let Q t (a) p(r(a) H t ) {a t = a} = {Q t (a) > Q t (b) b a}

18 Thompson sampling properties Posteriors over action values Thompson sampling Probabilistic action selection Suboptimal actions with high uncertainty are selected with larger probability than those with low uncertainty p(r H) p(r H) r r

19 Thompson sampling properties Posteriors over action values Thompson sampling Probabilistic action selection Fixed posteriors for unplayed actions infinite exploration Proof idea: Suppose L is only played finitely often posterior for r(l) freezes R played infinitely often, and posterior for r(r) converges so sampled values for R converge to r(r) So prob of playing L bounded below So t P(a t = L H t ) = (Borel Cantelli)

20 Thompson sampling properties Posteriors over action values Thompson sampling Probabilistic action selection Asymptotic average reward is max a r(a) Proof idea: Infinite exploration posteriors converge to r(a) For all large t, sampled values for a are close to r(a) with high probability ɛ > 0, prob of selecting best is larger than 1 ɛ for large t Coupling argument average reward converges to max r(a) a

21 Theory May, Korda, Lee and DL, JMLR 2012 Theorem In bandit problems with stationary reward functions r( a), if Thompson sampling is used then lim T T t=1 r( a t) T t=1 max a r( a) 1 (In English: The average reward is as good as it could be) Cleverer theory: finite time regret properties, in more restricted settings (see Korda, Agrawal and others)

22 Theory May, Korda, Lee and DL, JMLR 2012 Theorem In contextual bandit problems with stationary reward functions r(x, a), if Thompson sampling is used then lim T T t=1 r(x t, a t ) T t=1 max a r(x t, a) 1 (In English: The average reward is as good as it could be)

23 A problem Let Q t (a) p(r(a) H t ) be sampled value for action a Decompose as Q t (a) = ˆr t (a) + Exploratory bonus Thompson sampling gives negative exploratory bonuses????

24 A problem Let Q t (a) p(r(a) H t ) be sampled value for action a Decompose as Q t (a) = ˆr t (a) + Exploratory bonus Thompson sampling gives negative exploratory bonuses???? p(r H) p(r H) r r Reduced probability of selecting high variance optimal actions

25 Optimistic Bayesian Sampling May, Korda, Lee and DL, JMLR 2012 Let Q t (a) p(r(a) H t ) be sampled value for action a Set Qt OBS (a) = max{q t (a), ˆr t (a)} Select the action to maximise Q OBS All proofs go through as before

26 Emergent software with Barry Porter and Matthew Grieves App <interface> WebServer Main method: opens a server socket and accepts client connetions, each of which is passed to a request handler. Thread pool RequestHandler <interface> implementation Takes a client socket, applies a concurrency RequestHandler RequestHandlerPT approach, and passes the on socket to the HTTP handler. Thread per client implementation Implementation without caching or compression HTTPHandler <interface> Implementation with Implementation with caching caching and compression Takes a client HTTPHandler Implementation with socket, parses compression HTTP request headers and HTTPHandlerCMP HTTPHandlerCHCMP HTTPHandlerCH formulates a response. Compressor <interface> Cache <interface> GZip Cache CacheFS Deflate CacheLFU CacheMRU CacheLRU CacheRR

27 Emergent software with Barry Porter and Matthew Grieves Each component of the server can be provided by several implementations: 42 different valid configurations Configurations perform well under different traffic scenarios Learn to use best configuration Framework: Every 10 seconds, try a configuration, observe performance Uh oh: trying each configuration only once takes 7 minutes...

28 Regression model similar approach to Scott (2010) Each component corresponds to a factor variable: ResponseTime RequestHandler + HTTPhandler + Compressor + Cache A configuration conf corresponds to a binary vector x conf. Expected response time for deploying conf is given by x conf β where β is unknown. Only 11 regression coefficients

29 Iterative decision-making In each 10 second slot: Choose an action based on the fitted model Observe the outcome Add the observation to the pool of data Update the statistical model Challenge Need to manage explore exploit, as in simple bandits

30 Thompson sampling Thompson sampling implementation: Use Bayesian linear regression. Then for each t sample a β Th from the posterior at time t deploy conf which maximises x conf β Th That s it!

31 Initial results Repeatedly requesting a small text file Loss is the difference between the reciprocal of the optimal response time at that instant, and the reciprocal of the actual response time

32 Changing request patterns Low/High text and Low/High Entropy Different configurations are better for different request patterns

33 Changing request patterns Alternating traffic characteristics The request pattern alternates, switching every 10 iterations. Poor performance.

34 Using context Coding the context At end of iteration t, categorise the traffic as HighEnt/LowEnt and as HighText/LowText. Include Ent and Text as factors in the regression Also the interactions Ent:Cache and Text:Compressor Performance under different traffic characteristics is learned

35 Using context Decision-making Thompson sampling implementation: Use Bayesian linear regression. Then for each t sample a β Th from the posterior at time t deploy conf which maximises ((Ent t 1, Text t 1 ) x conf )β Th This makes the working assumption that (Ent t, Text t ) = (Ent t 1, Text t 1 )

36 Using context Results The request pattern alternates, switching every 10 iterations. Good performance.

37 Conclusion Contextual bandits and Thompson sampling: simple and (provably and empirically) effective Optimistic Bayesian sampling: removes negative exploratory bonus Extremely simple to deploy in more complicated settings Basic statistical approaches are a revelation to (some) Data Scientists

38 29 Jan 2016 David S. Leslie

39 Backup slides

40 Copify With G Malhotra, W Simm and R McVey Marketplace matching copywriting jobs with authors Copywriters select from the (ever-changing) available jobs

41 A Copify brief

42 A Copify brief

43 A Copify brief

44 The writer s view

45 Copify s challenge The brief Offer appropriate jobs to a writer when they log in Main differentiating features: Jobs: a relatively small amount of free text Writers: history of jobs accepted/declined Challenges include: only light computation is allowed zero to moderate data per writer each job is completed by only one writer a different set of available jobs on each login

46 Encoding a brief Whenever a job arrives, it is coded into regression vector x, consisting of: price reported topic category (SVD compressed) bag of semantic topics counts

47 Learning writer preferences For each writer w, we know which briefs they have been shown which briefs they have accepted Simple logistic regression to estimate writer preferences ˆβ w and covariance Σ w = var( ˆβ w ). Updated each night for each writer. If insufficient data (< 20 previous jobs) set ˆβ w and Σ w to a globally-estimated version with inflated covariance

48 Displaying jobs On page load, there are jobs j = 1,..., J waiting to be accepted Thompson sampling principle: System selects job j with probability job j is the best Implementation in regression framework: sample βw TS N( ˆβ w, Σ w ), select argmax x j βw TS j Optimistic version: replace x j β TS w with max{x j β TS w, x j ˆβ w }

49 Displaying jobs On page load, there are jobs j = 1,..., J waiting to be accepted Thompson sampling principle: System selects job j with probability job j is the best Implementation in regression framework: sample β TS w N( ˆβ w, Σ w ), rank jobs according to x j β TS w Optimistic version: replace x j β TS w with max{x j β TS w, x j ˆβ w }

50 Effectiveness The new brief is ranked highly. It is for a blog post about fantasy football. This writer has completed many tasks to do with football. The editorial team also know the writer to be football mad.

51 Effectiveness Hopefully some performance stats

RE X : A DEVELOPMENT PLATFORM AND ONLINE LEARNING APPROACH

RE X : A DEVELOPMENT PLATFORM AND ONLINE LEARNING APPROACH : A DEVELOPMEN PLAFORM AND ONLINE LEARNING APPROACH FOR RUNIME EMERGEN SOFWARE SYSEMS, Matthew Grieves, Roberto Rodrigues Filho and David Leslie School of Computing and Communications Department of Mathematics