Online Learning and Sequential Decision Making

Size: px

Start display at page:

Download "Online Learning and Sequential Decision Making"

Scarlett Eaton
5 years ago
Views:

1 Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision Making 1 / 59

2 Learning make predictions (in this class) make decisions based on examples / history use observed data to adapt one s behavior on new data Emilie Kaufmann Sequential Decision Making 2 / 59

3 Part II: Sequential Decision Making Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 3 / 59

4 Part II: Sequential Decision Making Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 4 / 59

5 Examples Clinical trials K treatment for a given symptom (with unknown effect) What treatment should be allocated to the next patient based on responses observed on previous patients? Emilie Kaufmann Sequential Decision Making 5 / 59

6 Examples Clinical trials K treatment for a given symptom (with unknown effect) What treatment should be allocated to the next patient based on responses observed on previous patients? Online advertisement K adds that can be displayed Which add should be displayed for a user, based on the previous clicks of previous (similar) users? Emilie Kaufmann Sequential Decision Making 5 / 59

7 Dynamic allocation of computational resource Numerical experiments: where to evaluate a costly function in order to find its maximum quickly? Emilie Kaufmann Sequential Decision Making 6 / 59

8 Dynamic allocation of computational resource Numerical experiments: where to evaluate a costly function in order to find its maximum quickly? Artificial intelligence for games: where to choose the next evaluation to perform in order to find the best move to play next? Emilie Kaufmann Sequential Decision Making 6 / 59

9 Multi-Armed Bandit one agent K possibles actions, which provide rewards Goal: maximize the sum of the rewards obtained. Emilie Kaufmann Sequential Decision Making 7 / 59

10 Multi-Armed Bandit A each time step t = 1,..., T, the agent chooses an action A t {1,..., K} observes a reward R t associated to the chosen action. Sequential strategy: A t depends on A 1, R 1,..., A t 1, R t 1. Emilie Kaufmann Sequential Decision Making 8 / 59

11 How to model the rewards? As for predicting individual sequences, we can first consider arbitrary rewards. action k: (arbitrary) sequence of rewards r k,1, r k,2..., r k,t reward at time t: reward of the chosen action R t = r At,t. we don t observe r k,t for k A t! Goal: maximize the sum of the rewards minimize regret R T = T max r k,t k {1,...,K} t=1 }{{} cumulative rewards of the best action T t=1 R t }{{} cumulative rewards of our strategy Emilie Kaufmann Sequential Decision Making 9 / 59

12 EXP3 for rewards maximization Hypothesis: r k,t [0, 1] pour tout t Parameter: η > 0. Initialization: for all k {1,..., K}, w k,1 = 1 K. Pour t = 1,..., n 1. Compute the probability vector p t = (p 1,t,..., p K,t ) where p k,t = w k,t K i=1 w i,t 2. Choose an action A t p t, i.e. P (A t = k) = p k,t 3. Observe the reward R t = r At,t. 4. Compute the (re-weighted) loss of the chosen action: L t = 1 R t p At,t 5. Update the weight of the chosen action ) w At,t+1 = w At,t exp ( η L t Emilie Kaufmann Sequential Decision Making 10 / 59

13 Analysis of EXP3 Theorem [Auer et al., 02] For the choice log(k) η T = KT EXP3(η T ) satisfies E[R T ] 2 ln(k) KT Leveraging some stochastic assumption on the rewards, one can obtain better regret guarantees. Emilie Kaufmann Sequential Decision Making 11 / 59

14 Outline Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 12 / 59

15 Stochastic Multi-Armed Bandit model A simple stochastic model: k = 1,..., K, (X k,t ) t N is i.i.d. with distribution ν k K arms K (unknown) probability distribution ν 1 ν 2 ν 3 ν 4 ν 5 At round t, an agent: chooses an arm A t receives a reward R t = X At,t ν At The sampling strategy (or bandit algorithm) (A t ) is sequential: A t+1 = F t (A 1, R 1,..., A t, R t ). [ T ] Goal: Maximize E t=1 R t. Emilie Kaufmann Sequential Decision Making 13 / 59

16 Stochastic Multi-Armed Bandit model A simple stochastic model: k = 1,..., K, (X k,t ) t N is i.i.d. with mean µ k K arms K (unknown) probability distribution µ 1 µ 2 µ 3 µ 4 µ 5 At round t, an agent: chooses an arm A t receives a reward R t = X At,t ν At The sampling strategy (or bandit algorithm) (A t ) is sequential: A t+1 = F t (A 1, R 1,..., A t, R t ). [ T ] Goal: Maximize E t=1 R t. Emilie Kaufmann Sequential Decision Making 13 / 59

17 Why Stochastic Bandits? ν 1 ν 2 ν 3 ν 4 ν 5 Goal: maximize ones gains in a casino? (HOPELESS) Emilie Kaufmann Sequential Decision Making 14 / 59

18 Clinical trials Historical motivation [Thompson 1933] B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response R t {0, 1} : P(R t = 1 A t = a) = µ a Goal: maximize the number of patients healed Emilie Kaufmann Sequential Decision Making 15 / 59

19 Online content optimization $$ Modern motivation (recommender systems, online advertisement, A/B Testing...) ν 1 ν 2 ν 3 ν 4 ν 5 For the t-th visitor of a website, recommend a movie A t observe a rating R t ν At (e.g. R t {1,..., 5}) Goal: maximize the sum of ratings Emilie Kaufmann Sequential Decision Making 16 / 59

20 Cognitive radios Agent: a smart radio device Arms: radio channels (frequency bands) streams indicating channel availabilities Channel 1 X 1,1 X 1,2... X 1,t... X 1,T Channel 2 X 2,1 X 2,2... X 2,t... X 2,T Channel K X K,1 X K,2... X K,t... X K,T At round t, the device: selects channel A t observes the channel availability R t = X At,t = 0 or 1 Goal: Maximize the number of successful transmissions Emilie Kaufmann Sequential Decision Making 17 / 59

21 Cognitive radios Agent: a smart radio device Arms: radio channels (frequency bands) streams indicating channel availabilities Arm 1 X 1,1 X 1,2... X 1,t... X 1,T Arm 2 X 2,1 X 2,2... X 2,t... X 2,T Arm K X K,1 X K,2... X K,t... X K,T At round t, the device: selects arm A t observes the channel availability R t = X At,t = 0 or 1 Goal: Maximize the number of successful transmissions Emilie Kaufmann Sequential Decision Making 17 / 59

22 Objective Goal: find a strategy maximizing [ T ] E R t. t=1 Oracle: always play the arm k = argmax µ k with mean µ = max µ k. k {1,...,K} k {1,...,K} Can we be almost as good as the oracle? [ T E t=1 R t ] T µ? Emilie Kaufmann Sequential Decision Making 18 / 59

23 Performance measure: the regret Maximizing rewards minimizing regret [ T ] R T := T µ E X t t=1 K = (µ µ k )E[N k (T )], k=1 N k (t): number of draws of arm k up to round t. Need for an Exploration/Exploitation tradeoff Emilie Kaufmann Sequential Decision Making 19 / 59

24 Performance measure: the regret Maximizing rewards minimizing regret [ T ] R T := T µ E R t We want the regret to grow sub-linearly: R T T t=1 0 (consistency) T what rate of regret can we expect? Emilie Kaufmann Sequential Decision Making 19 / 59

25 A lower bound on the regret Bernoulli bandit model, µ = (µ 1,..., µ K ) R T (µ) = K (µ µ k )E µ [N k (T )] k=1 When T grows, all the arms should be drawn infinitely many! [Lai & Robbins, 1985]: for any uniformly good strategy, µ k < µ lim inf T where E µ [N k (T )] log T d(p, p ) = KL(B(p), B(p )) the regret is at least logarithmic 1 d(µ k, µ ), = p log p 1 p + (1 p) log p 1 p. Emilie Kaufmann Sequential Decision Making 20 / 59

26 A lower bound on the regret Bernoulli bandit model, µ = (µ 1,..., µ K ) R T (µ) = K (µ µ k )E µ [N k (T )] k=1 When T grows, all the arms should be drawn infinitely many! [Lai & Robbins, 1985]: for any uniformly good strategy, µ k < µ lim inf T E µ [N k (T )] log T where d(p, p ) = KL(B(p), B(p )) 1 d(µ k, µ ), = p log p 1 p + (1 p) log p 1 p. can we find asymptotically optimal algorithm, i.e. algorithms matching the lower bound? Emilie Kaufmann Sequential Decision Making 20 / 59

27 Outline Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 21 / 59

28 Some (naive) strategies Idea 1 : Draw each arm T /K times EXPLORATION ( 1 R(T ) = K ) K (µ 1 µ k ) T k=2 Emilie Kaufmann Sequential Decision Making 22 / 59

29 Some (naive) strategies Idea 1 : Draw each arm T /K times EXPLORATION ( 1 R(T ) = K ) K (µ 1 µ k ) T k=2 Idea 2 : Always trust the empirical best arm where A t+1 = argmax ˆµ k (t) k {1,...,K} ˆµ k (t) = 1 t R s 1 N k (t) (As=k) s=1 is an estimate of the unknown mean µ k. EXPLOITATION R(T ) (1 µ 1 ) µ 2 (µ 1 µ 2 )T Emilie Kaufmann Sequential Decision Making 22 / 59

30 A better idea: Explore-Then-Exploit Given m {1,..., T /K}, draw each arm m times compute the empirical best arm ˆk = argmax k ˆµ k (Km) keep playing this arm until round T A t+1 = ˆk for t Km EXPLORATION followed by EXPLOITATION Emilie Kaufmann Sequential Decision Making 23 / 59

31 A better idea: Explore-Then-Exploit Given m {1,..., T /K}, draw each arm m times compute the empirical best arm ˆk = argmax k ˆµ k (Km) keep playing this arm until round T A t+1 = ˆk for t Km EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2, := µ 1 µ 2. Assumption: the rewards are bounded in [0, 1]. ) R(T ) m + T exp ( m 2 2 Proof: on the board, uses Hoeffding s inequality Emilie Kaufmann Sequential Decision Making 23 / 59

32 A better idea: Explore-Then-Exploit Given m {1,..., T /K}, draw each arm m times compute the empirical best arm ˆk = argmax k ˆµ k (Km) keep playing this arm until round T A t+1 = ˆk for t Km EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2, := µ 1 µ 2. Assumption: the rewards are bounded in [0, 1]. ) R(T ) m + T exp ( m 2 2 How to choose m? Emilie Kaufmann Sequential Decision Making 24 / 59

33 A better idea: Explore-Then-Exploit Given m {1,..., T /K}, draw each arm m times compute the empirical best arm ˆk = argmax k ˆµ k (Km) keep playing this arm until round T A t+1 = ˆk for t Km EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2, := µ 1 µ 2. Assumption: the rewards are bounded in [0, 1]. with m = 2 2 log R(T ) 2 ( T 2 2 ) [ log ( T 2 2 ) ] + 1 Emilie Kaufmann Sequential Decision Making 24 / 59

34 A better idea: Explore-Then-Exploit Given m {1,..., T /K}, draw each arm m times compute the empirical best arm ˆk = argmax k ˆµ k (Km) keep playing this arm until round T A t+1 = ˆk for t Km EXPLORATION followed by EXPLOITATION Analysis for two arms. µ 1 > µ 2, := µ 1 µ 2. Assumption: the rewards are bounded in [0, 1]. R(T ) 2 ( ) T 2 log with m = 2 2 log ( T 2 Problem: requires the knowledge of T and 2 ) Emilie Kaufmann Sequential Decision Making 25 / 59

35 Sequential Explore-Then-Exploit (2 arms) explore uniformly { until the random time } 4 log(t τ = inf t N : ˆµ 1 (t) ˆµ 2 (t) > 2 ) t ˆk = argmax k ˆµ k (τ) and (A t+1 = ˆk) for t {τ,..., T } R T 16 log (T ) + C( ). logarithmic regret without knowing Emilie Kaufmann Sequential Decision Making 26 / 59

36 Sequential Explore-Then-Exploit (2 arms) explore uniformly { until the random time } 4 log(t /t) τ = inf t N : ˆµ 1 (t) ˆµ 2 (t) > t ˆk = argmaxk ˆµ k (τ) and (A t+1 = ˆk) for t {τ,..., T } R T 2 log (T ) + C log(t ). same regret rate, without knowing [Garivier et al. On Explore-Then-Commit Strategies, 2016] Emilie Kaufmann Sequential Decision Making 27 / 59

37 Sequential Explore-Then-Exploit (2 arms) explore uniformly { until the random time } 4 log(t /t) τ = inf t N : ˆµ 1 (t) ˆµ 2 (t) > t ˆk = argmaxk ˆµ k (τ) and (A t+1 = ˆk) for t {τ,..., T } R T 2 log (T ) + C log(t ). still requires the knowledge of T... [Garivier et al. On Explore-Then-Commit Strategies, 2016] Emilie Kaufmann Sequential Decision Making 27 / 59

38 Outline Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 28 / 59

39 The optimism principle For each arm k, build a confidence interval on the mean µ k : I k (t) = [LCB k (t), UCB k (t)] LCB = Lower Confidence Bound UCB = Upper Confidence Bound Figure: Confidence intervals on the means after t rounds Emilie Kaufmann Sequential Decision Making 29 / 59

40 The optimism principle We apply the following principle: act as if the best possible model was the true model (optimism in face of uncertainty) Figure: Confidence intervals on the means after t rounds Thus, one selects at time t + 1 the arm A t+1 = argmax k=1,...,k [Lai and Robbins 1985] [Agrawal 1995] UCB k (t) Emilie Kaufmann Sequential Decision Making 30 / 59

41 How to build confidence intervals? We need UCB a (t) such that P (µ a UCB a (t)) 1 1 t. use concentration inequalities Assumption: the rewards are bounded in [0, 1]. Hoeffding inequality Z i i.i.d. with mean µ and Z i [0, 1]. For all s 1 ( ) Z1 + + Z s P µ + x e 2x 2 s s We can show that ( P µ a ˆµ a (t) + ) α log(t) 1 1 N a (t) t 2α 1. Emilie Kaufmann Sequential Decision Making 31 / 59

42 How to build confidence intervals? We need UCB a (t) such that P (µ a UCB a (t)) 1 1 t. use concentration inequalities Assumption: the rewards are bounded in [0, 1]. Hoeffding inequality Z i i.i.d. with mean µ and Z i [0, 1]. For all s 1 ( ) Z1 + + Z s P µ x e 2x 2 s s We can show that ( P µ a ˆµ a (t) + ) α log(t) 1 1 N a (t) t 2α 1. Emilie Kaufmann Sequential Decision Making 31 / 59

43 A UCB algorithm UCB(α) selects A t+1 = argmax a UCB a (t) where α log(t) UCB a (t) = ˆµ a (t) +. }{{} N a (t) exploitation term }{{} exploration bonus popularised by [Auer et al. 02] : UCB1, α = Emilie Kaufmann Sequential Decision Making 32 / 59

44 UCB in action Emilie Kaufmann Sequential Decision Making 33 / 59

45 Regret analysis of a UCB algorithm Theorem The UCB(α) algorithm satisfies, for α = 3/2, 6 π2 E[N a (T )] log(t ) (µ 1 µ a ) arms : R(T ) 6 log(t ) + C. Proof: on the board Emilie Kaufmann Sequential Decision Making 34 / 59

46 An improved analysis of UCB(α) Theorem [Bubeck 11],[Cappé et al. 13] For α > 1/2, the UCB(α) algorithm satisfies E[N a (T )] α (µ 1 µ a ) 2 log(t ) + O( log(t )). 2 arms : R(T ) 1 2 log(t ) + C log(t ). UCB combines exploration and exploitation and has a smaller regret than Explore-Then-Exploit approaches order-optimal for Bernoulli distributions [ Pinsker s inequality: d(µa, µ 1 ) 2(µ 1 µ a ) 2] Emilie Kaufmann Sequential Decision Making 35 / 59

47 The kl-ucb algorithm (for Bernoulli bandits, or other simple parametric families) A UCB-type algorithm: A t+1 = argmax u k (t)... associated to the right upper confidence bound: { u k (t) = max q : d (ˆµ k (t), q) log(t) }, d(x, y) = KL(B(x), B(y)). N k (t) k d(µ a (t),q) log(t)/n a (t) [Cappé et al. 13] : 0 µ a (t) [ ] u a (t) q E µ [N k (T )] 1 d(µ k, µ ) log T + O( log(t )). Emilie Kaufmann Sequential Decision Making 36 / 59

48 The kl-ucb algorithm (for Bernoulli bandits, or other simple parametric families) A UCB-type algorithm: A t+1 = argmax u k (t)... associated to the right upper confidence bound: { u k (t) = max q : d (ˆµ k (t), q) log(t) }, d(x, y) = KL(B(x), B(y)). N k (t) k d(µ a (t),q) log(t)/n a (t) µ a (t) [ ] u a (t) q kl-ucb is asymptotically optimal for Bernoulli bandits! Emilie Kaufmann Sequential Decision Making 36 / 59

49 Outline Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 37 / 59

50 The Bayesian choice Bernoulli bandit model µ = (µ 1,..., µ K ) frequentist view: µ 1,..., µ K are unknown parameters tools: estimators, confidence intervals Emilie Kaufmann Sequential Decision Making 38 / 59

51 The Bayesian choice Bernoulli bandit model µ = (µ 1,..., µ K ) Bayesian view: µ 1,..., µ K are random variables prior distribution : µ a U([0, 1]) tool: posterior distribution π k (t) = L (µ k R 1,..., R t ) = Beta(S k (t) + 1, N k (t) S k (t) + 1) π a (t) π a (t+1) si X t+1 = 0 π a (t) π a (t+1) si X t+1 = π S k (t) = t s=1 R s1 (As=k) sum of the rewards from arm k Emilie Kaufmann Sequential Decision Making 38 / 59

52 Bayesian algorithm A Bayesian bandit algorithm exploits the posterior distributions of the means to decide which arm to select Emilie Kaufmann Sequential Decision Making 39 / 59

53 The Bayes-UCB algorithm π k (t) the posterior distribution over µ k at the end of round t. Bayes-UCB [K., Cappé, Garivier 2012] selects A t+1 = argmax Q k {1,...,K} ( 1 1 t, π k(t) where Q(α, ν) is the quantile of order α of the distribution ν. Properties: P X ν (X Q(α, ν)) = α. easy to implement (quantiles of Beta distributions) also asymptotically optimal for Bernoulli bandits! ( q k (t) = Q 1 1 ) t, π k(t) u k (t) efficient in practice and easy to generalize ) Emilie Kaufmann Sequential Decision Making 40 / 59

54 Bayes-UCB in practice Emilie Kaufmann Sequential Decision Making 41 / 59

55 Thompson Sampling { a {1..K}, θa (t) π a (t) A t+1 = argmax θ a (t). a=1...k π 1 (t) π 2 (t) θ 0 1 (t) µ µ 2 θ 2 (t) Figure: TS selects arm 2 as θ 2 (t) θ 1 (t) the first bandit algorithm! [Thompson 1933] very efficient, beyond Bernoulli bandits matches the Lai and Robbins bound for Bernoulli bandits [K., Korda and Munos, 2012], [Agrawal and Goyal 2013] Emilie Kaufmann Sequential Decision Making 42 / 59

56 Thompson Sampling, a General Principle Draw a possible model from your posterior distribution and act optimally in this sampled model Emilie Kaufmann Sequential Decision Making 43 / 59

57 Outline Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 44 / 59

58 Contextual Bandits Example: movie recommendation What movie should Netflix recommend to a particular user, given the ratings provided by previous users? Ü to make good recommendation, we should take into account the characteristics of the movies / users Contextual bandit problem: at time t I a context xt is observed I an arm At is chosen I a reward Rt that depends on xt, At is received. Emilie Kaufmann Sequential Decision Making 45 / 59

59 Contextual Bandits Example: movie recommendation What movie should Netflix recommend to a particular user, given the ratings provided by previous users? Ü to make good recommendation, we should take into account the characteristics of the movies / users Example: linear model Rt = hxt,at, θi + t I xa,t Rd is a feature vector built for user t and movie a I θ Rd unknown parameter Emilie Kaufmann Sequential Decision Making 46 / 59

Ü to make good recommendation, we should take into account the characteristics of the movies / users Example:

60 Contextual Bandits Example: movie recommendation What movie should Netflix recommend to a particular user, given the ratings provided by previous users? Ü to make good recommendation, we should take into account the characteristics of the movies / users Example: logistic model P(Rt = + At = a) = 1 T 1 + e xt,a θ I xa,t Rd is a feature vector built for user t and movie a I θ Rd unknown parameter Emilie Kaufmann Sequential Decision Making 47 / 59

61 Tools for (linear/logistic) contextual bandits In essence, the tools are the same: Explore-Then-Exploit [Rusmeviechtentong and Tsitsiklis, Linearly Parameterized Bandits, 2010] Upper Confidence Bounds: build a confidence region for the regression parameter θ R d (Lin-UCB) [Abbasi-Yadkori et al., Improved algorithms for linear bandits, 2011] [Filippi et al., Generalized Linear Bandits, 2010] Thompson Sampling: Bayesian model with a prior distribution on the regression parameter θ R d. [Agrawal and Goyal, Thompson Sampling for Contextual Bandits with linear payoffs, 2013] Emilie Kaufmann Sequential Decision Making 48 / 59

62 Example: Thompson Sampling for Linear Bandits Bayesian model: R t = x T t,a t θ + ɛ t, θ N ( 0, κ 2 I d ), ɛt N ( 0, σ 2).. Explicit posterior distribution: ) p(θ x 1,A1, R 1,..., x t,at, R t ) = N (ˆθ(t), Σt (multivariate Gaussian distribution) Thompson Sampling θ(t) N A t+1 = argmax a=1,...,k (ˆθ(t), Σ t ), x T a,t+1 θ(t). Emilie Kaufmann Sequential Decision Making 49 / 59

63 Reinforcement Learning A more general setup in which one repeatedly take decisions that generate rewards and evolution of some state. ex. management of an electric grid Formalization: an agent in a state s t chooses an action a t, receives a reward r t+1 and ends up in a new state s t+1. modelling tool: Markov Decision Process (MDP) Emilie Kaufmann Sequential Decision Making 50 / 59

64 Reinforcement Learning A more general setup in which one repeatedly take decisions that generate rewards and evolution of some state. ex. management of an electric grid Formalization: an agent in a state s t chooses an action a t, receives a reward r t+1 and ends up in a new state s t+1. Reinforcement Learning = How to solve an unknwon MDP? Emilie Kaufmann Sequential Decision Making 50 / 59

65 Bandit Tools for building AI for Games To decide the next move to play: successively chose trajectories in the game tree perform (random) evaluations of certain positions How to sequentially select trajectories in the game tree? Algorithme UCT [Kocsis & Szepesvari 06]: UCB for Trees! Emilie Kaufmann Sequential Decision Making 51 / 59

66 Outline Sequential Ressource Allocation Stochastic Multi-Armed Bandit Models Bandit Algorithms for Maximizing Rewards First Ideas UCB algorithms Bayesian Bandit Algorithms Beyond Simple Bandit Models Bandit algorithms for Optimal Exploration Emilie Kaufmann Sequential Decision Making 52 / 59

67 A motivation: A/B Testing Which version of a webpage generates more revenue? For the t-th visitor: display version A t {1, 2,..., K} observe R t = 1 if a conversion occurs this is a bandit problem, but the company does not want to alternate forever between different versions... Emilie Kaufmann Sequential Decision Making 53 / 59

68 A motivation: A/B Testing Which version of a webpage generates more revenue? For the t-th visitor: display version A t {1, 2,..., K} observe R t = 1 if a conversion occurs possibly stop the test and identify the best version Emilie Kaufmann Sequential Decision Making 54 / 59

69 A pure-exploration problem The agent has to identify the arm with highest mean a (no loss when drawing bad arms) The agent uses a sampling strategy (A t ) stops at some (random) time τ upon stopping, recommends an arm â τ Goal: given a risk parameter δ identify the best version with high-probability P (â τ = a ) 1 δ using as few samples as possible (short testing phase) [Even-Dar et al. 06] E [τ] is small. Emilie Kaufmann Sequential Decision Making 55 / 59

70 An algorithm based on confidence intervals LUCB [Kalyanakrishnan et al. 12] relies on Upper And Lower confidence bounds. For KL-LUCB: u a (t) = max {q : N a (t)d(ˆµ a (t), q) log(ct/δ)} l a (t) = min {q : N a (t)d(ˆµ a (t), q) log(ct/δ)} sampling rule: B t+1 = argmax ˆµ a (t), C t+1 = argmax a b A t+1 stopping rule: τ = inf{t N : l Bt (t) > u Ct (t)} u b (t) Emilie Kaufmann Sequential Decision Making 56 / 59

71 KL-LUCB in action Emilie Kaufmann Sequential Decision Making 57 / 59

72 µ 1 > µ 2 µ K. Theoretical garantees Theorem [Kalyanakrishan et al.] For well-chosen confidence ([ intervals, LUCB ] is (δ)-correct and 1 K ( ) ) 1 1 E [τ δ ] = O + log δ a=2 a Remark: KL-UCB versus KL-LUCB behave quite differently Emilie Kaufmann Sequential Decision Making 58 / 59

73 Theoretical garantees µ 1 > µ 2 µ K. Theorem [Kalyanakrishan et al.] For well-chosen confidence ([ intervals, LUCB ] is (δ)-correct and 1 K ( ) ) 1 1 E [τ δ ] = O + log δ a=2 a Remark: KL-UCB versus KL-LUCB behave quite differently for further comments on this and optimal best arm identification algorithms [Kaufmann and Garivier, Finding the arm with largest mean: Two bandit frameworks, 2017] Emilie Kaufmann Sequential Decision Making 58 / 59

74 Reference The Bandit Book (follow the link) Tor Lattimore and Csaba Szepesvari, to be released in 2019 Emilie Kaufmann Sequential Decision Making 59 / 59

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)