A Reinforcement Learning Approach for Collaborative Filtering

Size: px

Start display at page:

Download "A Reinforcement Learning Approach for Collaborative Filtering"

Raymond Dennis
6 years ago
Views:

1 A Reinforcemen Learning Approach for Collaboraive Filering Jungkyu Lee, Byonghwa Oh 2, Jihoon Yang 2, and Sungyong Park 2 Cyram Inc, Seoul, Korea jklee@cyram.com 2 Sogang Universiy, Seoul, Korea {mrfive,yangjh,parksy}@sogang.ac.kr Absrac. In recen years, here has been much ineres in recommender sysems ha provide users wih personalized suggesions for producs or services. In his paper we presens a reinforcemen learning approach for collaboraive filering. Firs, we formalize he collaboraive filering problem as a Markov Decision Process. Then, we learn he connecion beween he ime sequence of user raings using Q-learning. Experimens demonsrae he feasibiliy of our approach and a igh relaionship beween he pas and curren raings. Keywords: recommender sysems, markov decision process, Q-learning, singular value decomposiion Inroducion Recommender sysems provide users wih personalized suggesions for producs or services. Recommender sysems can be combined wih various echnologies including auonomous driving and make appropriae suggesions in differen spaioemporal siuaions. Collaboraive Filering (CF) is one of he mos acive domain in recommender sysems since i avoids using informaion abou he conen, bu raher i uses hisorical daa (e.g. user raings). The daa of CF can be represened as a large sparse marix wih (user, iem, raing) enries, and he goal is o predic he missing enries of users s raings for iems hey have no ye considered. Many researchers have developed CF algorihms including Resriced Bolzmann Machine (RBM)[8], K-Neares Neighbor(KNN)[] and Singular Value Decomposiion (SVD)[5]. This paper presens a new CF algorihm based on reinforcemen learning, RLCF. Reinforcemen learning has been successful in many applicaions [3]. However, o he bes of our knowledge, here has been no aemp o apply reinforcemen learning o CF. The moivaion for applying reinforcemen learning o CF comes from he conras effec, which is he enhancemen or diminishmen of a percepion and relaed performance as a resul of immediae previous or simulaneous exposure o he simulus in he same dimension, wih a smaller or greaer value [0]. The conras effec can ake place in CF. For insance, a user can evaluae he same movie differenly depending on wha he wached

2 previously. Thus, he predicion of a user s raings for an unseen movie can be opimized by aking ino accoun he sequence of movies he has seen. Agains his background, we aemp o formulae he conras effec on CF as a Markov Decision Process (MDP) and apply one of he popular reinforcemen learning algorihms, Q-learning [9, ] o i. 2 Preliminaries 2. Problem Definiion Suppose he user-movie daase consis of raings (of 5 sars) for N users and M movies. Le X R N M be he marix of he raings. Le A R N M be a marix including enry (user, movie, raing) X as real (correc) raings. The performance of a CF algorihm can be measured by he error beween he predicion values of he algorihm and A. Le I {0, } N M is an indicaor funcion such ha: I ij = { if movie j is raed by user i in A 0 if he raings is missing in A Le P R N M be he marix of predicions made by he CF algorihm. The goal is o esimae he missing enry of X such ha he roo mean squared error (RMSE) is minimal. RMSE is defined as follows: n m i= j= RMSE(P, A) = I ij(a ij P ij ) 2 n m i= j= I (2) ij () 2.2 Problem Fomulaion as MDP MDP provides a mahemaical framework for a sequenial decision problem [2, 6], and has been used for a wide range of opimizaion problems. MDP consiss of a uple of 5 elemens (S, A, {P sa }, γ, R) where Sae S: The sae S is defined as he raings of movies. If he range of raings is ineger o K, he size of sae S is K. s (i) S refers o he raing of he -h movie ha user i wached. Acion A: As defined above, s (i), and s (i) + refer o he raings of -h movie and ( + )-h ha user i wached, respecively. Sae s (i) by aking acion a (i) A. We represen his process as follows: s (i) ransiions o s (i) + a (i) s (i) + (3) Noe ha A = S. Transiion Probabiliy P sa : We assume ha he MDP for CF is deerminisic. Tha is, if a user akes acion a (i) a sae s (i), ransiion o he nex sae s (i) + is no random. Since A = S and he MDP is deerminisic, we can see ha a (i) = s (i) + in our problem.

3 Table : Example of (user, movie, raing, order) daa movie movie 2 movie 3 movie 4 movie 5 user (2,s) missing (5,2nd) missing (2,3h) user 2 (,s) (,2nd) missing (3,3h) missing user 3 (5,s) missing missing (3,2nd) (5,3h) user 4 (,s) missing missing missing (4,2nd) user 5 (2,s) (4,2nd) (5,3h) (5,4h) missing Discoun facor γ: 0 < γ < is called he discoun facor. Reward r: When user i akes acion a (i) a sae s (i), user i receives a reward or penaly signals defined as follows: r(s (i), a (i) ) = s (i) +2 predicor(i, ) (4) predicor(i, ) is he predicion ha a CF algorihm esimaes and s (i) +2 is he nex, nex sae. The raionale behind (4) will be described in Secion Generaing Episodes As menioned earlier, X R N M consiss of (user, movie, raing) paired enries. If we use he CF daa wih (user, movie, raing, order) paired enries, we can generae he episodes as follows: s (i) j s () s (2) s (N) a () a () 2 a () 3 s 2 s 3 a() T s () a (2) T a (2) 2 a (2) 3 s 2 s 3 a(2) T s (2) a (N) s 2. a (N) 2 s 3 a (N) 3 T a(n) T s (N) is he raing ha user i scores for j-h wached movie. For insance, if he (user, movie, raing, order) enries of user-movie daa are like Table (The enry (,, 2, s) means ha user wached movie firs and gave 2 sars), he episodes are generaed as follows: T

4 3 RLCF RLCF consiss of wo phases: raining and predicion. The former is o compue he Q-able of he MDP, and he laer is o predic he rae for a user-movie pair using boh he oupu of a CF algorihm and he value of he Q-able. 3. Training Based on he MDP formulaion and preliminaries, we learn he Q funcion using Q-learning which can be summarized as (see [9, ] for deailed derivaion): Q(s, a) = Q(s, a) + α[r(s, a) + max a Q(δ(s, a), a ) Q(s, a)] (5) where Q(s, a) is he old esimaed sum of reward, r(s, a) + max a Q(δ(s, a), a ) is new esimaed sum of reward afer a sep forward, and α is he learning rae. Since he number of possible acions in sae s is S, Q(s, a) is a S S able. The episodes generaed as described in Secion 2.3 are used as raining daa for Q-learning. Here is he raining algorihm: Algorihm Training Algorihm in RLCF : Inpu: α : learning rae γ : discoun facor T i : he number of movies user i has raed 2: Oupu: Q(s, a) : he esimaed sum of reward 3: Iniialize s S, a A Q(s, a) = 0; 4: Conver X R N M o raining episode se as in Secion 2.3; 5: for each user i = : N do 6: for each movie j = : T i do 7: Calculae he reward: r(s (i) j, a(i) j ) = s(i) j+2 predicor(i, j); 8: Updae he Q-funcion Q(s, a) using eq (5); 9: end for 0: end for 3.2 Predicion The predicion p (i) j of he enry x ij X ha user i gives for movie j can be calculaed by as follows: p (i) j = predicor(i, j) + Q(s i j 2, a i j 2) (6) In eq (4), he reason why we used he nex nex sae s (i) +2, no he nex sae, in he reward funcion r(s(i), a (i) ) is relaed o eq (6). If we used he nex s (i) +

5 Table 2: The RMSE of RLCF Algorihm Base RLCF improvemen γ α MM SVD SVD sae s (i) +, predicion p(i) j mus be defined as p (i) j = predicor(i, j) + Q(s i j, a i j ) (7) However, i is impossible o know Q(s i j, ai j ) = Q(si j, si j ) when we ry o predic p (i) j. This is because he parameer si j is exacly he raing we wan o predic. 4 Experimen 4. Daa se We adoped MovieLens daa se which conains 0,000,054 raings applied o 0,68 movies by 7,567 users of he online movie recommender service [7]. Users who raed a leas 20 movies were chosen and heir raings were sored chronologically. 20% he mos recen movie raing of he enire daa is used for esing, and he remaining 80% is used for raining. 4.2 Base Predicor As a base predicor for CF, we adoped SVD and SVD++ which is an improvemen over sandard SVD by incorporaing implici feedback, implemened by sochasic gradien descen. (See [4] for deailed descripions on he algorihm.) In addiion, we added a simple movie means (MM) algorihm ha oupus he mean of movie raings. 4.3 Resuls The sae S is defined wih raings in [0.5, 5.0] wih 0.5 inervals, making S = 0 and Q(s, a) is a 0 0 able. We compared he RMSE of RLCF combined wih base predicors wih ha of pure base predicors. We rained SVD, SVD++ predicors wih 0 laen feaures. There are wo parameers in RLCF o be uned: he learning rae α and he discoun facor γ of Q-learning. Tables 2 shows experimenal resuls for each predicor wih various parameer seings. As expeced, SVD++ produced he bes performance while MM produced he wors. RLCF improved he performance regardless of he base predicor. Tha is, MM, SVD, and SVD++ wih RLCF predicor reduced RMSE by , , respecively han is pure predicor. This verifies our idea of incorporaing he conras effec via reinforcemen learning.

6 5 Conclusion We presened a new reinforcemen learning based algorihm, RLCF, for CF. We firs formalize he CF as MDP o model he effec ha he sequences of a users previous raings influence curren raings. Then, he effec is learned and recorded in a Q-able via Q-learning. The experimenal resuls show ha he order of pas raings affec he curren raings significanly, and is hus useful o elici more accurae predicions. There are some ineresing direcions for fuure work. When we formalize he CF as MDP, saes can be defined by oher facors as well as raings. For example, he genre, season, and diverse ags can be defined as saes in MDP. If he movie genre is used, RLCF can discover he effec ha influences raings for acion movies when a user waches an acion movie afer having wached a comedy. In addiion, RLCF can be combined wih oher emerging echnologies o provide relevan informaion o he user. For insance, i can be implemened in auonomous vehicles o make suggesions based on he driver s spaioemporal conex. Acknowledgmens This work was suppored by he Sogang Universiy Research Gran of 202 ( ). References. Bell, R., Koren, Y.: Improved neighborhood-based collaboraive filering. In: Proc. KDD-Cup and Workshop a he 3h ACM SIGKDD Inernaional Conference on Knowledge Discovery and Daa Mining. Cieseer (2007) 2. Bellman, R.: A Markovian decision process. (957) 3. Kaelbling, L., Liman, M., Moore, A.: Reinforcemen learning: A survey. Arxiv preprin cs/ (996) 4. Koren, Y.: Facorizaion mees he neighborhood: a mulifaceed collaboraive filering model. In: Proceeding of he 4h ACM SIGKDD inernaional conference on Knowledge discovery and daa mining. pp ACM (2008) 5. Paerek, A.: Improving regularized singular value decomposiion for collaboraive filering. In: Proceedings of KDD Cup and Workshop. vol Cieseer (2007) 6. Puerman, M.: Markov decision processes. Handbooks in Operaions Research and Managemen Science 2, (990) 7. Riedl, J., Konsan, J.: Movielens daase (998) 8. Salakhudinov, R., Mnih, A., Hinon, G.: Resriced Bolzmann machines for collaboraive filering. In: Proceedings of he 24h inernaional conference on Machine learning. p ACM (2007) 9. Suon, R., Baro, A.: Reinforcemen learning: An inroducion. The MIT press (998) 0. Thursone, L.: A law of comparaive judgmen. Psychological review 34(4), (927). Wakins, C., Dayan, P.: Q-learning. Machine learning 8(3), (992)

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology