Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Size: px

Start display at page:

Download "Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017"

Anna Reynolds
5 years ago
Views:

1 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29

2 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29

3 At each iteration t, the player chooses x t K. A convex loss function f t F : K R is revealed. A cost f t (x t ) is incurred. F is a set of bounded functions. f t is revealed after choosing x t. f t can be adversarially chosen. s (OCO) Convex 4/29

4 Regret Given an algorithm A. Regret of A after T iterations is defined as: regret T (A) := sup { {f i } T i=1 F T t=1 Online Gradient Descent O(GD T ). f t (x t ) min x K T f t (x)} t=1 If f i is α-strongly convex then O( G 2 2α (1 + log(t )). s (OCO) Convex 6/29

5 Convex 1 In OCO we had access to f t (x t ). in BCO we only observe f t (x t ). further constrains the BCO setting. s (OCO) Convex 1 Material from [1]. 7/29

6 Convex In data networks, the decision maker can measure the RTD of a packet, but rarely has access to the congestion pattern of the entire network. In ad-placement, the search engine can inspect which ads were clicked through, but cannot know whether different ads, had they been chosen, would have been click through or not. Given a fixed budget, how to allocate resources among the research projects whose outcome is only partially known at the time of allocation and may change through time. Wikipedia. Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it. 9/29 s (OCO) Convex

7 (MAB) On each iteration t, the player choses an action i t from a predefined set of discrete actions {1,..., n}. An adversary, independently, chooses a loss [0, 1] for each action. The loss associated with i t is then revealed to the player. There are a variety of MAB specifications with various assumptions and constraints. This definition is similar to the multi-expert problem, except we do not observe the loss associated with the other experts. s (OCO) Convex 11/29

8 MAB as a BCO The algorithms usually choose an action w.r.t a distribution over the actions. If we define K = n, an n-dimensional simplex then f t (x) = l t x = n l t (i)x(i) i=1 x K We have an exploration-exploitation trade-off. A simple approach would be to Exploration With some probability, explore by choosing actions uniformly at random. Construct an estimate of the actions losses with the feedback. Exploitation Otherwise, use the estimates to make a decision. s (OCO) Convex 12/29

9 An algorithm can be constructed: s (OCO) Convex 14/29

10 Analysis This algorithm guarantees: T E[ l t (i t ) min i t=1 T l t (i)] O(T 3 4 n) t=1 E[ˆl t (i)] = P[b t = 1] P[i t = i b t = 1] n δ l t(i) = l t (i) s (OCO) Convex ˆl t 2 n δ l t(i t ) n δ E[ˆf t ] = f t 15/29

11 Analysis s (OCO) Convex 16/29

12 s Has a worst-case near-optimal regret bound of O( Tn log n). See P104 of the OCO book for proof. (OCO) Convex 18/29

13 Stochastic On each iteration t, the player choses an action i t from a predefined set of discrete actions {1,..., n}. Each action i has an underlying (fixed) probability distribution P i with mean µ i. The loss associated with i t is then revealed to the player. (A sample is taken from P it ). P i s could be a simple Bernoulli variable. A more complex version could assume a Markov process for each action, within which the state of one or all processes change after each iteration. We still have the exploration-exploitation trade-off. s (OCO) Convex 20/29

14 Bernoulli 2 N[a] The number of times arm a is pulled. Q[a] The running average of rewards for arm a. S[a] The number of successes for arm a. F [a] The number of failures for arm a. The same notion of regret, except the optimal strategy is to pull the arm with the largest mean, µ. s (OCO) Convex 2 Material from [2]. 22/29

15 Random Selection O(T ). Greedy Selection O(T ). ɛ-greedy Selection O(T ). Boltzmann Exploration O(T ). Upper-Confidence Bound O(ln(T )). Thompson Sampling O(ln(T )). s (OCO) Convex 24/29

16 Empirical Evaluation s (OCO) Convex 25/29

17 There are scenarios within which we can have access to more information. The extra information can be encoded as a context vector. In online advertising, the behaviour of each user, or the search context for instance, can provide valuable information. One simple way is to treat each context having its own bandit problem. Variations of the previous algorithms relate the context vector with the expected reward through linear models, neural networks, kernels, or random forests. s (OCO) Convex 27/29

18 Thanks s Thanks! Questions? (OCO) Convex 28/29

19 References I s E. Hazan. Introduction to. S. Raja. s and Exploration Strategies. (OCO) Convex 29/29

Advanced Machine Learning

Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his