Graphs in Machine Learning

Size: px

Start display at page:

Download "Graphs in Machine Learning"

Jonathan Price
6 years ago
Views:

1 Graphs in Machine Learning Michal Valko Inria Lille - Nord Europe, France Partially based on material by: Toma s Koca k November 23, 2015 MVA 2015/2016

2 Last Lecture Examples of applications of online SSL Analysis of online SSL SSL Learnability When does graph-based SSL provably help? Scaling harmonic functions to millions of samples Michal Valko Graphs in Machine Learning SequeL - 2/40

3 Previous Lab Session by Daniele Calandriello Content Semi-supervised learning Graph quantization Online face recognizer Short written report Questions to piazza Deadline: ~calandri/teaching.html Michal Valko Graphs in Machine Learning SequeL - 3/40

4 This Lecture Online decision-making on graphs Graph bandits Smoothness of rewards (preferences) on a given graph Observability graphs Exploiting side information Michal Valko Graphs in Machine Learning SequeL - 4/40

5 Final Class projects detailed description on the class website preferred option: you come up with the topic theory/implementation/review or a combination one or two people per project (exceptionally three) grade 60%: report + short presentation of the team deadlines strongly recommended DL for taking projects hard DL for taking projects submission of the project report (or later) - project presentation list of suggested topics on piazza Michal Valko Graphs in Machine Learning SequeL - 5/40

6 Online Decision Making on Graphs Michal Valko Graphs in Machine Learning SequeL - 6/40

7 Online Decision Making on Graphs: Smoothness Sequential decision making in structured settings we are asked to pick a node (or a few nodes) in a graph the graph encodes some structural property of the setting goal: maximize the sum of the outcomes application: recommender systems First application: Exploiting smoothness fixed graph iid outcomes neighboring nodes have similar outcomes Michal Valko Graphs in Machine Learning SequeL - 7/40

8 Online Decision Making on Graphs Movie recommendation: (in each time step) Recommend movies to a single user. Good prediction after a few steps (T N). Goal: Maximize overall reward (sum of ratings). Assumptions: Unknown reward function f : V (G) R. Function f is smooth on a graph. Neighboring movies similar preferences. Similar preferences neighboring movies. The Shawshank Redemption (1994) The Godfather (1972) The Godfather: Part II (1974) The Dark Knight (2008) Pulp Fiction (1994) The Good, the Bad and the Ugly (1966) Schindler s List (1993) 12 Angry Men (1957) The Lord of the Rings: The Return of the King (2003) Fight Club (1999) The Lord of the Rings: The Fellowship of the Ring (2001) Star Wars: Episode V - The Empire Strikes Back (1980) Inception (2010) Forrest Gump (1994) One Flew Over the Cuckoo s Nest (1975) The Lord of the Rings: The Two Towers (2002) Goodfellas (1990) Star Wars: Episode IV - A New Hope (1977) The Matrix (1999) Seven Samurai (1954) City of God (2002) Se7en (1995) The Usual Suspects (1995) The Silence of the Lambs (1991) Once Upon a Time in the West (1968) It s a Wonderful Life (1946) Léon: The Professional (1994) Casablanca (1942) Life Is Beautiful (1997) Raiders of the Lost Ark (1981) American History X (1998) Psycho (1960) Rear Window (1954) City Lights (1931) Saving Private Ryan (1998) Spirited Away (2001) The Intouchables (2011) Memento (2000) Terminator 2: Judgment Day (1991) Modern Times (1936) 1 0 Michal Valko Graphs in Machine Learning SequeL - 8/40

9 Let s be lazy: Ignore the structure! This is an multi-armed bandit problem! The performance depends on the number of movies (N arms). NT Worst case regret (to the best fixed strategy) RT = O What is N for imdb.com? 3,538,545 Michal Valko Graphs in Machine Learning SequeL - 9/40

10 Let s be lazy: Ignore the structure! Another problem of the typical bandits strategies for recommendation? If there is no information shared, we need to try all of the options! UCB/MOSS and likely TS start with pulling each of the arms once This is a problem both algorithmically and theoretically.... Watch all the movies and then I tell you which one you like.... What do we need for movie recommendation? An algorithm useful in the case T N! Exploiting the structure is a must! Michal Valko Graphs in Machine Learning SequeL - 10/40

11 Recap: Smooth graph functions f = (f 1,..., f N ) T : Vector of function values. Let L = QΛQ T be the eigendecomposition of the Laplacian. Diagonal matrix Λ whose diagonal entries are eigenvalues of L. Columns of Q are eigenvectors of L. Columns of Q form a basis. α : Unique vector such that Qα = f Note: Q T f = α S G (f) = f T Lf = f T QΛQ T f = α T Λα = α 2 Λ = N λ i (αi ) 2 i=1 Smoothness and regularization: Small value of (a) S G (f) (b) Λ norm of α (c) α i for large λ i Michal Valko Graphs in Machine Learning SequeL - 11/40

12 Smooth graph functions: Flixster eigenvectors 0.2 Eigenvector Eigenvector Eigenvector Eigenvector Eigenvector Eigenvector Eigenvector 7 Eigenvector 8 Eigenvector 9 Eigenvectors 0.2 from the Flixster 0.2 data corresponding 0.2 to the smallest few eigenvalues of the graph Laplacian projected onto the first principal 0 component 0of data. Colors indicate 0 the values Michal 0 Valko 1 Graphs 1 in Machine0 Learning SequeL 1-12/40

13 Online Learning Setting - Bandit Problem Learning setting for a bandit algorithm π In each time t step choose a node π(t). the π(t)-th row x π(t) of the matrix Q corresponds to the arm π(t). Obtain noisy reward r t = xπ(t) T α + ε t. Note: xπ(t)α T = f π(t) ε t is R-sub-Gaussian noise. ξ R, E[e ξεt ] exp ( ξ 2 R 2 /2 ) Minimize cumulative regret R T = T max (xaα T ) a T xπ(t) T α. t=1 What is a good result? Can t we just use linear bandits? Michal Valko Graphs in Machine Learning SequeL - 13/40

14 Online Decision Making on Graphs: Smoothness Linear bandit algorithms LinUCB (Li et al., 2010) Regret bound D T ln T LinearTS (Agrawal and Goyal, 2013) Regret bound D T ln N Note: D is ambient dimension, in our case N, length of x i. Number of actions, e.g., all possible movies HUGE! Spectral bandit algorithms SpectralUCB (Valko et al., ICML 2014) Regret bound d T ln T Operations per step: D 2 N SpectralTS (Kocák et al., AAAI 2014) Regret bound d T ln N Operations per step: D 2 + DN Note: d is effective dimension, usually much smaller than D. Michal Valko Graphs in Machine Learning SequeL - 14/40

15 Effective dimension Effective dimension: Largest d such that (d 1)λ d T log(1 + T /λ). Function of time horizon and graph properties λ i : i-th smallest eigenvalue of Λ. λ: Regularization parameter of the algorithm. Properties: d is small when the coefficients λ i grow rapidly above time. d is related to the number of non-negligible dimensions. Usually d is much smaller than D in real world graphs. Can be computed beforehand. Michal Valko Graphs in Machine Learning SequeL - 15/40

16 Effective dimension vs. Ambient dimension 7 Barabasi Albert graph N= Flixster graph: N= effective dimenstion effective dimenstion time T d D time T Note: In our setting T < N = D. Michal Valko Graphs in Machine Learning SequeL - 16/40

17 UCB-style algorithms: Estimate Expected reward Michal Valko Graphs in Machine Learning SequeL - 17/40

18 UCB-style algorithms: Sample Expected reward Michal Valko Graphs in Machine Learning SequeL - 17/40

19 UCB-style algorithms: Estimate... Expected reward Michal Valko Graphs in Machine Learning SequeL - 17/40

20 SpectralUCB Given a vector of weights α, we define its Λ norm as α Λ = N λ k αk 2 = α T Λα, k=1 and fit the ratings r v with a (regularized) least-squares estimate ( t ) α t = arg min α [ x v, α r v ] 2 + α 2 Λ. v=1 α Λ is a penalty for non-smooth combinations of eigenvectors. Michal Valko Graphs in Machine Learning SequeL - 18/40

21 SpectralUCB 1: Input: 2: N, T, {Λ L, Q}, λ, δ, R, C 3: Run: 4: Λ Λ L + λi 5: d max{d : (d 1)λ d T / ln(1 + T /λ)} 6: for t = 1 to T do 7: Update the basis coefficients α: 8: X t [x π(1),..., x π(t 1) ] T 9: r [r 1,..., r t 1] T 10: V t X tx T t + Λ 11: 12: α t V 1 t X T t r c t 2R d ln(1 + t/λ) + 2 ln(1/δ) + C 13: π(t) arg max a (xa T α + c t x a V 1 t 14: Observe the reward r t 15: end for ) Michal Valko Graphs in Machine Learning SequeL - 19/40

22 SpectralUCB: Synthetic experiment cumulative regret Barabasi Albert N=250, basis size=3, effective d=1 SpectralEliminator SpectralUCB LinUCB time T Michal Valko Graphs in Machine Learning SequeL - 20/40

23 SpectralUCB: Movie data experiments cumulative regret Movielens: Cumulative regret for randomly sampled users. T = 100 SpectralUCB LinUCB cumulative regret Flixster: Cumulative regret for randomly sampled users. T = 100 SpectralUCB LinUCB Michal Valko Graphs in Machine Learning SequeL - 21/40

24 SpectralUCB: Regret Bound d: Effective dimension. λ: Minimal eigenvalue of Λ = Λ L + λi. C: Smoothness upper bound, α Λ C. x T i α [ 1, 1] for all i. The cumulative regret R T of SpectralUCB is with probability 1 δ bounded as R T ( 8R d ln λ + T λ + 2 ln 1 δ + 4C + 4 ) dt ln λ + T λ. R T d T ln T Michal Valko Graphs in Machine Learning SequeL - 22/40

25 SpectralUCB: Regret Bound Derivation of the confidence ellipsoid for α with probability 1 δ. Using analysis of OFUL (Abbasi-Yadkori et al., 2011) x T ( α α ) x V 1 t ( ( ) ) R 2 ln Vt 1/2 + C δ Λ 1/2 Regret in one time step: r t = x α T xπ(t) T α 2c t x π(t) V 1 t Cumulative regret: R T = T r t T T rt 2 2( c T + 1) t=1 t=1 Upperbound for ln( V t / Λ ) ln V t Λ ln V T Λ 2T ln V T Λ ( ) λ + T 2d ln λ Michal Valko Graphs in Machine Learning SequeL - 23/40

26 SpectralUCB: Regret Bound Sylvester s determinant theorem: A + xx T = A I + A 1 xx T = A (1 + x T A 1 x) Goal: Upperbound determinant A + xx T for x 2 1 Upperbound x T A 1 x x T A 1 x = x T QΛ 1 Q T x = y T Λ 1 y = N i=1 λ 1 i yi 2 y 2 1. y is a canonical vector. x = Qy is an eigenvector of A. Michal Valko Graphs in Machine Learning SequeL - 24/40

27 SpectralUCB: Regret Bound Corollary: Determinant V T of V T = Λ + T t=1 x tx T t is maximized when all x t are aligned with axes. V T max ti =T (λi + t i ) ln V T Λ ln V T Λ ( max ln 1 + t i ti =T d i=1 ( ln 1 + T ) λ ( d ln 1 + T λ ( 2d ln 1 + T λ + λ i ) + T ) ) N i=d+1 λ d+1 ( ln 1 + t ) i λ d+1 Michal Valko Graphs in Machine Learning SequeL - 25/40

28 SpectralUCB: Improving the running time Reduced basis: We only need first few eigenvectors. Getting J eigenvectors: O(Jm log m) time for m edges Computationally less expensive, comparable performance. cumulative regret J = 20 J = 200 J = 2000 computational time in seconds time T 10 0 J = 20 J=200 J=2000 Michal Valko Graphs in Machine Learning SequeL - 26/40

29 SpectralUCB: How to make it even faster? UCB-style algorithms need to (re)-compute UCBs every t Can be a problem for large set of arms D 2 N N 3 Optimistic (UCB) approach vs. Thompson Sampling Play the arm maximizing probability of being the best Sample α from the distribution N ( α, v 2 V 1 ) Play arm which maximizes x T α and observe reward Compute posterior distribution according to reward received Only requires D 2 + DN N 2 per step update Michal Valko Graphs in Machine Learning SequeL - 27/40

30 Thomson Sampling: Estimate α ˆα Michal Valko Graphs in Machine Learning SequeL - 28/40

31 Thomson Sampling: Sample α α ˆα Michal Valko Graphs in Machine Learning SequeL - 28/40

32 Thomson Sampling: Estimate ˆα α Michal Valko Graphs in Machine Learning SequeL - 28/40

33 Thomson Sampling: Sample ˆα α α Michal Valko Graphs in Machine Learning SequeL - 28/40

34 Thomson Sampling: Estimate... ˆα α Michal Valko Graphs in Machine Learning SequeL - 28/40

35 SpectralTS for Graphs 1: Input: 2: N, T, {Λ L, Q}, λ, δ, R, C 3: Initialization: 4: v = R 6d log((λ + T )/δλ) + C 5: α = 0 N 6: f = 0 N 7: V = Λ L + λi N 8: Run: 9: for t = 1 to T do 10: Sample α N ( α, v 2 V 1 ) 11: π(t) arg max a xa T α 12: Observe a noisy reward r(t) = xπ(t)α T + ε t 13: f f + x π(t) r(t) 14: Update V V + x π(t) x T π(t) 15: Update α V 1 f 16: end for Michal Valko Graphs in Machine Learning SequeL - 29/40

36 SpectralTS: Regret bound d: Effective dimension. λ: Minimal eigenvalue of Λ = Λ L + λi. C: Smoothness upper bound, α Λ C. x T i α [ 1, 1] for all i. The cumulative regret R T of SpectralTS is with probability 1 δ bounded as R T 11g 4 + 4λ dt log λ + T + 1 p λ λ T + g ( ) T log 2 p λ δ, where p = 1/(4e π) and g = ( ( λ + T 4 log TN R 6d log δλ ) ) + C + R 2d log ( (λ + T )T 2 δλ ) + C. R T d T log N Michal Valko Graphs in Machine Learning SequeL - 30/40

37 SpectralTS: Analysis sketch Divide arms into two groups i = x T α x T i α g x i V 1 t i = x T α x T i α > g x i V 1 t Saturated arm arm i is unsaturated arm i is saturated Small standard deviation accurate regret estimate. High regret on playing the arm Low probability of picking Unsaturated arm Low regret bounded by a factor of standard deviation High probability of picking Michal Valko Graphs in Machine Learning SequeL - 31/40

38 SpectralTS: Analysis sketch Confidence ellipsoid for estimate µ of µ (with probability 1 δ/t 2 ) Using analysis of OFUL algorithm (Abbasi-Yadkori et al., 2011) ( ) ) (λ + T )T xi T α xi (R T α 2 2 d log + C x i δλ V 1 = l x i t V 1 t The key result coming from spectral properties of V t. log Vt Λ ( 2d log 1 + T ) λ Concentration of sample α around mean α (with probability 1 1/T 2 ) Using concentration inequality for Gaussian random variable. ( ( λ + T xi T α xi T α R 6d log δλ ) ) + C x i V 1 t 4 log(tn) = v xi V 1 4 log(tn) t Michal Valko Graphs in Machine Learning SequeL - 32/40

39 SpectralTS: Analysis sketch Define regret (t) = regret(t) 1{ xi T α(t) xt i α l x i V 1} t regret (t) 11g p x a(t) V t T 2 Super-martingale (i.e. E[Y t Y t 1 F t 1 ] 0) X t = regret (t) 11g p x a(t) V 1 t t Y t = X w. w=1 1 T 2 (Y t ; t = 0,..., T ) is a super-martingale process w.r.t. history F t. Azuma-Hoeffding inequality for super-martingales, w.p. 1 δ/2: T regret (t) 11g T x a(t) p V t T + g ( ) T ln 2 p λ δ t=1 t=1 Michal Valko Graphs in Machine Learning SequeL - 33/40

40 Spectral Bandits: Synthetic experiment cumulative regret Barabasi Albert N=250, basis size=3, effective d=1 SpectralTS LinearTS SpectralUCB LinUCB computational time in seconds time T SpectralTS LinearTS SpectralUCB LinUCB Michal Valko Graphs in Machine Learning SequeL - 34/40

41 Spectral Bandits: Real world experiment MovieLens dataset of 6k users who rated one million movies. cumulative regret Movielens data N=2019, average of 10 users, T=200, d = 5 SpectralUCB LinUCB SpectralTS LinearTS computational time in seconds time t SpectralTS LinearTS SpectralUCB LinUCB Michal Valko Graphs in Machine Learning SequeL - 35/40

42 Spectral Bandits Summary Spectral bandit setting (smooth graph functions). SpectralUCB Regret bound RT = Õ ( d ) T ln T SpectralTS ( Regret bound RT = Õ d ) T ln N Computationally more efficient. SpectralEliminator ( dt ) Regret bound RT = Õ ln T Better upper, empirically does not seem to work well (yet) Bounds scale with effective dimension d D. Michal Valko Graphs in Machine Learning SequeL - 36/40

43 SpectralEliminator: Pseudocode Input: N : the number of nodes, T : the number of pulls {Λ L, Q} spectral basis of L λ : regularization parameter β, {t j } J j parameters of the elimination and phases A 1 {x 1,..., x K }. for j = 1 to J do V tj γλ L + λi for t = t j to min(t j+1 1, T ) do Play x t A j with the largest width to observe r t : x t arg max x Aj x V 1 t V t+1 V t + x t xt T end for Eliminate the arms that are not promising: α t V 1 t { [x tj,..., x t ][r tj,..., r t ] T A j+1 x A j, α t, x + x V 1 t end for ]} β max x Aj [ α t, x x V 1β t Michal Valko Graphs in Machine Learning SequeL - 37/40

44 SpectralEliminator: Analysis SpectralEliminator Divide time into sets (t 1 = 1 t 2... ) to introduce independence for Azuma-Hoeffding inequality and observe R T J j=0 (t j+1 t j ) [ x x t, α j + ( x V 1 j Bound x x t, α j for each phase No bad arms: x x t, α j ( x V 1 j By algorithm: x 2 V 1 j t j s=t j 1 +1 min (1, x s 2 V 1 s 1 1 tj t j t j 1 ) log V j Λ + x t V 1)β j s=t j 1 +1 x s 2 V 1 + x t V 1)β ] j s 1 Michal Valko Graphs in Machine Learning SequeL - 38/40

45 Spectral Bandits: Is it possible to do better? Is d a good quantity that embodies the difficulty? Lower bound! For any d, we construct a graph that for any reasonable algorithm, the regret is at least Ω( dt ). How? By reduction to d-arm bandits problem. K MT K MT K MT weights ε K MT K MT K MT Michal Valko Graphs in Machine Learning SequeL - 39/40

46 SequeL Inria Lille MVA 2015/2016 Michal Valko sequel.lille.inria.fr

Graphs in Machine Learning

Graphs in Machine Learning Michal Valko INRIA Lille - Nord Europe, France Partially based on material by: Rob Fergus, Tomáš Kocák March 17, 2015 MVA 2014/2015 Last Lecture Analysis of online SSL Analysis