Yevgeny Seldin. University of Copenhagen

Size: px

Start display at page:

Download "Yevgeny Seldin. University of Copenhagen"

Shona Warner
5 years ago
Views:

1 Yevgeny Seldin University of Copenhagen

2 Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New situation Decision/Action

3 How Online is different from batch? Batch Learning Online Learning Collect Data Act Analyze Analyze Get More Data Act

4 Examples Investment in the stock market Online advertising/personalization Online routing Games Robotics

5 When do we need Online Learning? Until recently, statistical theory has been restricted to the design and analysis of sampling experiments in which the size and composition of the samples are completely determined before the experimentation begins. The reasons for this are partly historical, dating back to the time when the statistician was consulted, if at all, only after the experiment was over, and partly intrinsic in the mathematical difficulty of working with anything but a fixed number of independent random variables. A major advance now appears to be in the making with the creation of a theory of the sequential design of experiments, in which the size and composition of the samples are not fixed in advance but are functions of the observations themselves. (Robbins, 1952)

6 When do we need Online Learning? Interactive learning Adversarial game-theoretic settings No i.i.d. assumption Intelligent data collection / experiment design Large-scale data analysis

8 The Space of Online Learning Problems limited (bandit)

9 The Space of Online Learning Problems stock market medical treatments advertising limited (bandit)

10 The Space of Online Learning Problems stock market medical treatments advertising limited (bandit) Exploration-Exploitation Trade-off

11 Exploration-Exploitation Trade-off What drug to give to a new patient Never tried 0/2 6/10 When there are more patients to come Act Analyze More Data We are building the dataset for ourselves

12 The Space of Online Learning Problems environmental resistance adversarial spam filtering stock market medical treatments weather prediction i.i.d. bandit

13 The Space of Online Learning Problems environmental resistance adversarial spam filtering stock market medical treatments weather prediction i.i.d. bandit

Learning in Adversarial Environments Game theoretic setting Cannot be treated in batch learning Collect Data Act Evaluation measure: regret Analyze

14 Learning in Adversarial Environments Game theoretic setting Cannot be treated in batch learning Collect Data Act Evaluation measure: regret Analyze Difference in performance compared to the best choice in hindsight (out of a limited set) E.g. investment revenue vs. the best stock in hindsight Act Analyze More Data

15 The Space of Online Learning Problems environmental resistance adversarial contextual Markov Decision Processes (MDP) i.i.d. stateless structural complexity bandit medical records of different patients subsequent treatments of the same patient

16 The Space of Online Learning Problems environmental resistance adversarial contextual Markov Decision Processes (MDP) i.i.d. stateless structural complexity bandit medical records of different patients subsequent treatments of the same patient Delayed Feedback

17 Delayed Feedback Would you like to have a drink? Yes No Would you like to have a drink? Yes!!! No The next morning.

18 The Space of Online Learning Problems environmental resistance Reinforcement Learning contextual Markov Decision Processes (MDP) adversarial i.i.d. stateless structural complexity bandit subsequent treatments of the same patient Delayed Feedback

19 The Space of Online Learning Problems environmental resistance adversarial contextual i.i.d. stateless bandit MDPs Classical batch ML would be here structural complexity

20 Simplicities along the axes environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

21 Simplicities along the axes environmental resistance adversarial The more the simpler i.i.d. stateless contextual bandit MDPs structural complexity

22 Simplicities along the axes environmental resistance adversarial Gaps (between action outcomes) Variance/Range (of action outcomes) i.i.d. stateless contextual bandit MDPs structural complexity

23 Simplicities along the axes environmental resistance adversarial Reducibility of the state space (context relevance) Mixing time (in MDPs) i.i.d. stateless contextual bandit MDPs structural complexity

24 Some standard algorithms environmental resistance adversarial Prediction with expert advice Adversarial bandits i.i.d. stateless contextual bandit Stochastic bandits MDPs structural complexity

25 Typical performance scaling environmental resistance T ln K adversarial KT contextual i.i.d. stateless bandit a a ln T Δ(a) MDPs structural complexity K the number of actions T the number of rounds Δ(a) suboptimality gap

26 Standard algorithms environmental resistance T ln K adversarial KT contextual i.i.d. stateless bandit a a ln T Δ(a) MDPs Assume a certain form of simplicity and exploit it structural complexity

27 environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

28 Prediction with limited advice / Bandits with paid observations [Seldin, Bartlett, Cramer, Abbasi-Yadkori, ICML, 2014] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Examples: Multiple algorithms or/and parametrizations, but limited computational resources. Only a subset can be executed. Expensive resources

29 Environmental resistance in info [Koolen & van Erven, COLT, 2015, Luo & Schapire, COLT, 2015, Wintenberger, 2015, van Erven, Kotłowski & Warmuth, COLT, 2014, Gaillard, Stoltz & van Erven, COLT 2014, Cesa-Bianchi, Mansour & Stoltz, MLJ, 2007, ] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Examples: Small deviations from i.i.d. Not adversariality Adaptation to i.i.d. Adaptation to effective outcome range

30 Environmental resistance in bandits [Bubeck & Slivkins, COLT, 2012, Seldin & Slivkins, ICML, 2014, Auer & Chiang, COLT, 2016, Seldin & Lugosi, COLT, 2017, Wei & Luo, COLT, 2018, Zimmert & Seldin, AISTATS, 2019] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Examples: Contaminated observations Adaptation to i.i.d. X Adaptation to effective outcome range is impossible! [Gershinovitz & Lattimore, NIPS, 2016]

31 Prediction with Limited Advice [Thune & Seldin, NIPS, 2018] environmental resistance adversarial i.i.d. stateless contextual MDPs structural complexity bandit Starting from at least 2 observations Adaptation to i.i.d. Adaptation to effective outcome range

32 K actions Prediction with Limited Advice Problem Setting 1 l 1 1 l 2 1 l t i l 1 i l 2 i l t K l 1 K l 2 K l t time Adversarial l t i arbitrary in [0,1] I.I.D. = μ i Gaps: Δ i = μ i min μ j j Effective Loss Range ε E l t i i, j, t: l t i l t j ε Evaluation measure: pseudo-regret R T = E σt A t=1 l t t min E σ T i t=1 l t i {1,,K}

33 SODA: Second Order Difference Adjustment [Thune & Seldin, NIPS, 2018] The primary action A t sampled according to p t i e η i td t 1 η 2 i t S t 1 The secondary observation B t sampled uniformly from 1,, K \{A t } Unbiased loss difference estimates Δl i B s = K 1 I B t = i l t A t l t t Loss difference estimator and its second moment D i t = σt i s=1 Δls S i t = σt s=1 The learning rate Δls i 2 η t ln K max i Si t 1

34 SODA: Regret Guarantees [Thune & Seldin, NIPS, 2018] Adversarial R T = O ε KT ln K Stochastic R T = O Kε2 ln K σ i:δ i >0 1 Δ i Knowledge of i.i.d./adversarial not required! Knowledge of ε not required! Simultaneous adaptation to two types of easiness!

35 An optimal algorithm for i.i.d. and adversarial bandits [Zimmert & Seldin, AISTATS, 2019] environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

36 K actions I.I.D. and adversarial bandits Problem Setting 1 l 1 1 l 2 1 l t i l 1 i l 2 i l t K l 1 K l 2 K l t time Evaluation measure: pseudo-regret (as before) R T = E σt A t=1 l t t min E σ T i t=1 l t i {1,,K} Adversarial as before Stochastically Constrained Adversary [Wei & Luo, 2018] i best action E l t i l t i = Δ i 0 I.I.D.: special case when E l t i = μ i

37 α-tsallis-inf [Zimmert & Seldin, AISTATS, 2019] Tsallis entropy H α x = 1 1 α 1 σ i x i α INF Implicitly Normalized Forecaster [Audibert & Bubeck, 2009]

38 α-tsallis-inf [Zimmert & Seldin, AISTATS, 2019] w t = arg min w, L t 1 w Δ K Sample A t w t Update L i i t = L t 1 + σ i wi + l t i I(A t =i) w t i α αη t,i α = 1 corresponds to entropic regularization EXP3 algorithm α = 0 corresponds to log-barrier

39 1 2 -Tsallis-INF: Regret Guarantees [Zimmert & Seldin, AISTATS, 2019] η t = 1/t Adversarial R T 4 KT + 1 Stochastically Constrained Adversary R T σ i i 4 ln T+20 Δ i i.i.d. is a special case + 4 K Both results match the lower bounds (up to constants)! No knowledge of i.i.d./adversarial is required!

40 Tsallis-INF in relation to prior work [Zimmert & Seldin, AISTATS, 2019] BROAD [Wei&Luo,2018] Log-barrier + doubling (α = 0) Regime Upper Bound Lower Bound i.i.d. O K 2 adversarial O ln T α = 1 2 i.i.d. & adversarial O 1 EXP3++ [Seldin&Lugosi,2017] Entropic regularization + mix in extra exploration (α = 1) i.i.d. adversarial O O ln T ln K

41 1 2 -Tsallis-INF: Experiments [Zimmert & Seldin, AISTATS, 2019] i.i.d. Stochastically Constrained Adversary

42 1 2 -Tsallis-INF: Additional Results [Zimmert & Seldin, AISTATS, 2019] Optimality in the moderately contaminated stochastic regime I.I.D. and adversarial optimality in utility-based dueling bandits

43 environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

44 environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

45 environmental resistance adversarial i.i.d. stateless contextual bandit MDPs structural complexity

46 Check my homepage (or Google me) Papers, tutorials, lecture notes, etc Join our Advanced Topics in Machine Learning course Co-taught by Christian Igel September-October 2019 WARNING: Math-heavy course

47 α-tsallis-inf: Some Considerations [Zimmert & Seldin, AISTATS, 2019] Exploration rate η t i L t i L t i 1 1 α For i.i.d. regret of E L t i L t i = Δ i t η t i Δ i t 1 1 α = 1 Δ i 2 t ln T Δ i the exploration rate must be 1 Δ i 2 t η t i = Δ i 1 2α t α α = 1 2 is the only value for which η t i requires no tuning by (unknown) Δ i

48 α-tsallis-inf: Some Considerations [Zimmert & Seldin, AISTATS, 2019] E L t i L t i = Δ i t The variance of L t i L t i is of order Δ i 2 t 2 L t i L t i cannot be efficiently controlled by Bernstein s inequality Our analysis is based on a self-bounding property of the regret

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play