Multi-task Linear Bandits

Size: px

Start display at page:

Download "Multi-task Linear Bandits"

Rosalind Wiggins
5 years ago
Views:

1 Multi-task Linear Bandits Marta Soare Alessandro Lazaric Ouais Alsharif Joelle Pineau INRIA Lille - Nord Europe INRIA Lille - Nord Europe McGill University & Google McGill University NIPS2014 Workshop on Transfer and Multi-task Learning : Theory meets Practice Montreal, December 2014

2 Motivating example: Movie recommendation system Features: age, profession,... Ratings Learn his preference as fast as possible! Marta Soare Multi-task Linear Bandits 2/10

3 Motivating example: Movie recommendation system Features: age, profession,... Ratings Learn his preference as fast as possible! Features: age, profession,... Learn his preference faster (with fewer ratings)! Marta Soare Multi-task Linear Bandits 2/10

4 Sequential multi-task linear bandit Linear bandit task: Set of arms X R d, X = K and x 2 L, x X. Reward model for task j r j (x) = x θ j + η where θ j R d is unknown and η is an R sub-gaussian noise. Cumulative regret w.r.t. the optimal arm = arg max x X x θ j x j n j R nj = s=1 (x j x s ) θ j Marta Soare Multi-task Linear Bandits 3/10

5 Sequential multi-task linear bandit Linear bandit task: Set of arms X R d, X = K and x 2 L, x X. Reward model for task j r j (x) = x θ j + η where θ j R d is unknown and η is an R sub-gaussian noise. Cumulative regret w.r.t. the optimal arm = arg max x X x θ j x j n j R nj = s=1 (x j x s ) θ j Task Similarity: there exists ε > 0 such that for any pair of tasks (θ, θ ) we have θ θ 2 ε. Marta Soare Multi-task Linear Bandits 3/10

6 Sequential multi-task linear bandit Linear bandit task: Set of arms X R d, X = K and x 2 L, x X. Reward model for task j r j (x) = x θ j + η where θ j R d is unknown and η is an R sub-gaussian noise. Cumulative regret w.r.t. the optimal arm = arg max x X x θ j x j n j R nj = s=1 (x j x s ) θ j Task Similarity: there exists ε > 0 such that for any pair of tasks (θ, θ ) we have θ θ 2 ε. Improve performance on a particular task by leveraging information from previous tasks! Marta Soare Multi-task Linear Bandits 3/10

7 Single task estimates: OLS estimate: For any task j, the estimate of θ j after n rewards is A j,n = ˆθ j,n = A 1 j,n b j,n n x t xt b j,n = t=1 n x t r t Prediction error: Thm.2 in [Abbasi Yadkori et al., NIPS 2012] (w.p. 1 δ) ( x θ j x ˆθλ x (A 1 + nl2 /λ ) ) j,n λ j,n ) (R d log + λ 1/2 θ 1 j δ }{{} B j,n(x) t=1 Marta Soare Multi-task Linear Bandits 4/10

8 Multi-task estimates : θ m+1,t = Ã 1 m+1,t b m+1,t E [ θm+1,t ] = θ m+1,t m Ã m+1,t = A j +A m+1,t m bm+1,t = b j +b m+1,t j=1 j=1 Theorem Let θ m+1,t λ be the multi-task regularized least-squares estimate. Then, for any δ 0, for any t 1, with probability greater than 1 δ it holds that: x ( θ m+1,t λ θ m+1 ) ) ) +λ 1/2 S +ε x. x ( Ã (R d log ( det(ãλ m+1,t )1/2 det(λi) 1/2 λ m+1,t ) 1 δ }{{} B m+1,t(x) Marta Soare Multi-task Linear Bandits 5/10

9 Multi-task estimates : θ m+1,t = Ã 1 m+1,t b m+1,t E [ θm+1,t ] = θ m+1,t m Ã m+1,t = A j +A m+1,t m bm+1,t = b j +b m+1,t j=1 j=1 Theorem Let θ m+1,t λ be the multi-task regularized least-squares estimate. Then, for any δ 0, for any t 1, with probability greater than 1 δ it holds that: x ( θ m+1,t λ θ m+1 ) ) ) +λ 1/2 S +ε x. x ( Ã (R d log ( det(ãλ m+1,t )1/2 det(λi) 1/2 λ m+1,t ) 1 δ }{{} B m+1,t(x) Marta Soare Multi-task Linear Bandits 5/10

10 Multi-task estimates : θ m+1,t = Ã 1 m+1,t b m+1,t E [ θm+1,t ] = θ m+1,t m Ã m+1,t = A j +A m+1,t m bm+1,t = b j +b m+1,t j=1 j=1 Theorem Let θ m+1,t λ be the multi-task regularized least-squares estimate. Then, for any δ 0, for any t 1, with probability greater than 1 δ it holds that: x ( θ m+1,t λ θ m+1 ) ) ) +λ 1/2 S +ε x. x ( Ã (R d log ( det(ãλ m+1,t )1/2 det(λi) 1/2 λ m+1,t ) 1 δ }{{} B m+1,t(x) Marta Soare Multi-task Linear Bandits 5/10

11 MT-LinUCB Input: budgets {n j } j, arms X R d, regularizer λ j = 1 A = λi d, b = b = 0 d, Ãj = λi d, ˆθ j = A 1 b for j = 1,..., m + 1 do Compute the multi-task OLS solution: Ã j = Ãj + A λi d, b = b + b, θ j = 1 b Ãj Compute the task-specific OLS solution: A = λi d, b = 0 d, ˆθ j = A 1 b for t = 1,..., n j do Select arms according to the tightest bound: x t =arg max min x X [ x ˆθj + B j,t (x); x θ j + B j,t (x) Observe reward: r t = x t θ j + η t Update: A, b, ˆθ j, B j,t, Ãj, b, θ j, B j,t end for end for ] Marta Soare Multi-task Linear Bandits 6/10

12 Experiments 200 tasks with θ 1,..., θ 200 R 2 randomly generated max θ θ 2 = samples for each task ε = MT-LinUCB is never worse than Regret for each task LinUCB MT LinUCB LinUCB Task number Marta Soare Multi-task Linear Bandits 7/10

13 Experiments As the difference between tasks reduces, the advantage of MT-LinUCB becomes more evident. For ε = 0.2 2, the regret of MT-LinUCB decreases with every additional task, while the regret for LinUCB remains constant over time. Regret for each task LinUCB MT LinUCB Task number Marta Soare Multi-task Linear Bandits 8/10

14 Follow up work Regret bound for MT-LinUCB. Additional experiments on datasets (see Boston Housing example on the poster). Weighting scheme according to task relevance. Marta Soare Multi-task Linear Bandits 9/10

15 Thank you! SequeL INRIA Lille

Graphs in Machine Learning

Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France Partially based on material by: Toma s Koca k November 23, 2015 MVA 2015/2016 Last Lecture Examples of applications of online SSL