Incorporating Heterogeneous Information for Personalized Tag Recommendation in Social Tagging Systems

Size: px

Start display at page:

Download "Incorporating Heterogeneous Information for Personalized Tag Recommendation in Social Tagging Systems"

Clarence Andrews
6 years ago
Views:

1 Incorporating Heterogeneous Information for Personalized Tag Recommendation in Social Tagging Systems Wei Feng Tsinghua niversity Beijing, China Jianyong Wang Tsinghua niversity Beijing, China ABSTRACT A social tagging system provides users an effective way to collaboratively annotate and organize items with their own tags. A social tagging system contains heterogeneous information like users tagging behaviors, social networks, tag semantics and item profiles. All the heterogeneous information helps alleviate the cold start problem due to data sparsity. In this paper, we model a social tagging system as a multi-type graph. To learn the weights of different types of nodes and edges, we propose an optimization framework, called OptRank. OptRank can be characterized as follows: Edges and nodes are represented by features. Different types of edges and nodes have different set of features. 2 OptRank learns the best feature weights by maximizing the average AC Area nder the ROC Curve of the tag recommender. We conducted experiments on two publicly available datasets, i.e., Delicious and Last.fm. Experimental results show that: OptRank outperforms the existing graph based methods when only <user, tag, item> relation is available. 2 OptRank successfully improves the results by incorporating social network, tag semantics and item profiles. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Information Filtering, Retrieval Models, Selection Process General Terms Algorithms Keywords Recommender System, Social Tagging System. INTRODCTION In social tagging systems, users can annotate and organize items with their own tags for future search and sharing. For Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD 2, August 2 6, 202, Beijing, China. Copyright 202 ACM /2/08...$5.00. u u 2 ::: u jj sers t t 2 ::: t jtj Tags i i 2 ::: i jtj Items Figure : Social Tagging System example, users can annotate and share Web pages in Delicious. Besides Delicious, there are many other social tagging system like Last.fm 2 and YouTube 3 in entertainment domain and CiteLike 4 in the research domain. Personalized tag recommendation is the key part of a social tagging system. When a user wants to annotate an item, the user may have his/her own vocabulary to organize items. Personalized tag recommendation tries to find the tags that can precisely describe the item with the user s vocabulary. A social tagging system, as shown in Figure, contains heterogeneous information and can be modeled as a graph: sers, tagst and itemsi co-exist in the graph. Inter-relation. Edges between users, tags and items can be derived from annotation behaviors <user, tag, item>. Suppose we have u and t T, the weight of <u, t> is the times of tag t being used by user u. The same rule applies to <u, i> and <i, t> i I. Intra-relation. Social network among users.2 Tag semantic network based on semantic relatedness. 3 Item network based on content similarities. While the inter-relation has been well studied in previous work [4, 5, 2, 3, 6, 7], few work tries to incorporate all the intra-relation into a unified model. Incorporating the intra-relation may solve the cold start problem due to data sparsity. sers in a social network may influence each other by sharing some annotated items. Semantically related tags may co-occur to describe an item. Items that have similar contents may be annotated with the same tag. When a user u wants to annotate an item i, the recommended tags should meet two requirements: Highly relevant to user u because users have their own way to organize items. 2 Highly relevant to item i because tags should precisely describe the item. To rank the tags, we can perform

2 a random walk with restart at user u and item i to assign each tag a visiting probability, which is used as the ranking score. Only tags that are both relevant to u and i can get high scores. However, two problems arise when the random walk is performed on the multi-type graph: Different types of edges have different meanings and thus are measured in different metrics. For example, the edge weights of a social network may be binary and they have completely different meanings from other types of edges, such as the edges formed by tagging behaviors <user, tag, item>. To perform a random walk, they need to be measured under the same metric. The random walker can either restart from the user u or the item i. The probabilities of restart at u and at i should be estimated. To solve the above two problems, we propose an optimization framework called OptRank. OptRank can be characterized as follows: Edges are represented by features. Different types of edges have different set of features. For example, <u, u 2> u, u 2 in a social network is represented by the feature set {the number of common tags, the number of common items}. The edge <t, u> u, t T is represented by the feature {the times of t being used by u}. Each feature has a feature weight. The edge weight is decided by both the features and the feature weights. ser u and item i for recommendation are represented by a constant feature but their feature weights are learned separately. OptRank learns the feature weights by maximizing the average AC Area under the ROC Curve of the tag recommender. Although graph based methods have been studied in the field of personalized tag recommendation by many researchers [4, 5, 7], most of them belong to the unsupervised approach, in which the edge weights and restart probabilities of u and i are empirically assigned. Inspired by the recent development of semi-supervised learning [3] and graph-based learning [], we are able to turn the existing unsupervised graph-based methods into supervised methods. More specifically, we extend the supervised random walk proposed in [] for link prediction into the setting of personalized tag recommendation. This paper has two major differences from [] : The graph in our setting contains different types of edges, each of them has their own set of features and the corresponding feature weights are learned separately. 2 Since we have two nodes for restart, we further introduce node features. To summarize, our contributions are as follows: To solve the cold start problem due to data sparsity, we are among the first to explore the three new relations: social network, tag semantic relatedness and item content similarities. We propose a graph model and extend the random walk with restart to the multi-type graph to handle different types of relations uniformly. We propose an optimization framework to learn the best edge weights and node weights by maximizing the average AC of the tag recommender. The remainder of this paper is organized as follows. The problem we addressed is formulated in Section 2. Graph model and random walk with restart are introduced in Section 3. Our optimization framework OptRank is introduced in Section3. Experimental study is described in Section 5. Related work is introduced in Section 6. We conclude the paper and discuss the future work in Section PROBLEM STATEMENT AND BASIC FR- AMEWORK Personalized Tag Recommendation. Given a user u and an item i, personalized tag recommendation tries to find tags to describe or classify the item i precisely according to u s vocabulary. Inter-relations and intra-relations among users, items and tags are considered, which makes the graph as a multi-type graph as shown in Figure. Highly ranked tags should be relevant to both u and i. To achieve this goal, a random walk with restart is performed on the multi-type graph with restart at user u and item i. Only tags that are both near to u and item i can get a high visiting probability. Formally, the random walk with restart is performed according to the following equation: p p T p I t+ = αa p p T p I t + α q 0 q I where α is the restart probability. - α means that the random walker has the probability of -α to perform a random jump based on his current state. p T = p T,p T T,p T I is a vector of visiting probabilities of all nodes. p T contains the ranking scores for each tag. A is the transition matrix that stores graph structure information. A is obtained by normalizing each column of the adjacency matrix A to sum to. q T = q T,0 T,q T I is the preference vector that contains the restart probabilities of each node. q is obtained by normalizing the node weight vector q to sum to. The transition matrix A and the preference vector q will be introduced in detail in Section 3. Optimization Framework To get a good ranking by following Equation, the transition matrix A and the preference vector q need to be carefully assigned. Thus we develop an optimization framework called OptRank. Given a user u and an item i for personalized tag recommendation, suppose u has finally annotated i with tags t, t 2,...t k, these tags are defined to be positive tags, denoted by PT. The rest tags are defined to be negative tags, denoted by NT. In other words, the whole tag set T is divided into two parts, i.e. T = P T NT. A good ranking function defined by Equation should rank all the positive tags higher than the negative tags. For a randomly picked positive tag t and a negative tag t 2, a good ranking function has a high probability of ranking t higher than t 2. This is the idea of AC Area nder the ROC Curve metric. Formally, AC is defined by the following equation: i PT j NT AC = Ip T i p T j 2 PT NT 277

3 where Ix is when x > 0. Otherwise Ix is 0. Our goal is to find the best transition matrix A and the preference vector q to maximize the AC. To achieve this, edges are represented by features X and nodes u and i are represented by features Y. To better illustrate the idea, we can assume the adjacency matrix A only contains edges of the same type. A with different types of edges will be introduced in Section 3.. Each edge <v, u> u,v T I is represented by a feature vector Xu, v. Let θ represent the vector of feature weights, the edge weight Au, v is computed by Au, v = f edge θ T Xu, v, where f edge :R R +. ser u and item i are respectively represented by feature vector 5 Y = and Y I =. Let ξ denote the feature weights. The node weights q u and q I i are computed by q u=f node ξ T Y and q I i=f node ξ T I Y I, where f node :R R +. Other entries of q and q I are all 0. According to the above representation, the transition matrix A and the adjacency matrix A can be rewritten to Aθ and Aθ. q and q can be rewritten to qξ and qξ. This means they are respectively decided by parameters θ and ξ. Since the random walk is defined by Aθ and qξ according to Equation, we know that p can be rewritten to pθ, ξ, which means the final ranking scores are parameterized by θ and ξ. However, to make the following formulae more clear, we will not rewrite the above notations with parameters θ and ξ. With edges and nodes parameterized by θ and ξ, we give a formal description of our optimization framework. Given a user u and an item i for tag recommendation and the positive tag set, the optimization problem is max ACθ, ξ = θ,ξ i PT j NT Ip Ti p T j PT NT However, the above equation only considers a single training instance. When m instances {< u k, i k, PT k >} m k= are considered, the cost function Jθ, ξ is defined as the average AC: max θ,ξ Jθ, ξ = m m k= i PT k j NT k Ip T i p T j where NT k = T PT k. The optimization framework OptRank and its solution will be introduced in Section GRAPH MODEL Before introducing the optimization problem, we first introduce more details about Equation. Section 3. introduces the transition matrix. Section 3.2 describes the preference vector. Section 3.3 introduces more intuitions and details of the random walk with restart. 3. Transition Matrix Transition matrix stores the graph structure information. Before defining the transition matrix, we first introduce how to construct a graph from a social tagging system. The graph shown in Figure is constructed with three steps: sers, tags and items are mapped as the nodes. 2 All the 5 Nodes are allowed to have more than one feature, so Y and Y I are still in bold to represent vectors. 3 binary relations, i.e., social network, tag semantic relatedness and item content similarities are mapped to edges. 3 For ternary relation <user, tag, item> where three nodes are involved, binary relations can be derived by projections on each dimension. For example, suppose we have <u, t, i> u, t T, i I, <i, t> can be derived by projecting on the user dimention. <i, t> is described by the feature which is the times of i annotated with t. Now we define the adjacency matrix. Let G denote the whole graph as shown in Figure and A denote its adjacency matrix. Let G MN M, N {, T, I} denote the each sub-graph made up by relation <m,n> m M, n N and A MN denote its adjacency matrix. We have G= M,N {,T,I} G MN and A is composed of sub-matrices A MN: A = A A T AT A TT AI A TI A I A IT A II 4 Recall that edges are represented by features. In Section 2, the edge feature set is denoted by X and the feature weights is denoted by θ. Since different types of edges have different features and feature weights. We have X={X MN M, N {, T, I}} and θ={θ MN M, N {, T, I}}. Given an edge <m,n> {M, N}, A MNm,n is defined by A MNm, n = f edge θ T MNX MNm, n 5 Note that X MNm,n is a vector and X MN is an array of three dimensions. In this paper, f edge : R R + is the sigmoid function: f edge x = + e x 6 Transition matrix A is obtained by normalizing each column of A: A = AD A TD A ID A TD T A TTD T A ITD T A ID I A TID I A IID I 7 where D, D T and D I are diagonal matrices. The i-th entry in the diagonal of D is the out-degree of the i-th user. For u, we have D u, u = M {,T,I} M k= AMk, u 8 D T and D I are defined in the same way. Following this definition, each column of A will be normalized to sum to. 3.2 Preference Vector Given a user u and an item i for tag recommendation, the preference vector q T = q T,0T,q T I specifies the restart probability at u and i. As introduced in Section 2, ser u and item i are respectively represented by feature vector Y = and Y I =. Let ξ = {ξ, ξ I } denote the feature weights. Node weight q M m M {, I}, m {u, i} is computed by q M m = f node ξ T MYM 9 The other entries of q and q I are all set to 0. f node : R R + is the sigmoid function in this paper: f node x = + e x 0 278

4 u 2 u t 2 t t 4 i 2 i i. 5Semantically related tags. t 4 and t are semantically related, which means that they may co-occur in the annotation. When data is sparse, i.e., u and i are both inactive, more information can be taken into account by jumping more than two-hops away. Now we introduce another intuition behind the random walk. With the transition matrix A defined by Equation 7, we can rewrite Equation as follows: u 3 t 3 i 3 p = αa p + A Tp T + A Ip I + αq 3 p T = αa Tp + A TTp T + A TIp I 4 Figure 2: The random walker restart at u and i in no more than 2-hops The preference vector q T = q T,0T,q T I is obtained by normalizing q T = q T,0,qT I to sum to : q = q Dq 0 q I Dq where D q is the summation of each entry in q and q I. Formally, Dq is defined as the following equation: D q = M {,I} Equation ensures that q sums to. M k= q Mk Random Walk With Restart In this section, we introduce more intuitions of the random walk with restart for personalized tag recommendation. As we introduced in Section 2, the random walker can frequently restart at u and i to rank the tags. We illustrate this idea with an example shown in Figure 2. In Figure 2, we want to recommend tags for user u to annotate i, so the random walker restarts frequently from u and i. The edges indicate how the random walker jumps from node to node. u has the history that she/he has annotated i 2 before. i has the history that it has been annotated by u 3. Besides annotation relation, u 2 is a friend of u, i 3 has similar contents with i, and t 4 has high semantic relatedness with t. Now we discuss how the random walker behaves in no more than two hops from u and i : When the random walker is only allowed to jump one hop from user u and item i, the recommended tags either have been used by user u or have been annotated on item i by other users. As we can see from Figure 2, t is such a tag. When u has annotated many items and i has been annotated by many users, the random walker will find the best common tags in both sets of u s tags and i s tags. When the random walker is allowed to jump within two hops, the recommended tags come from different sources: Items annotated by u. For example, i 2 has been annotated by u and i 2 has a tag t 2. t 2 may reflect the interests of u. 2 sers that have annotated item i. Since u 3 has annotated i, the tags annotated by u 3 may reflect the content of i. 3 Friends of u. u 2 is a friend of u and his/her tags may also be adopted by u. 4 Similar items. Since i 3 and i have similar content, the tags of i 3 may also be the tags of p I = αa Ip + A ITp T + A IIp I + αq I 5 where A MN=A MND N M, N {, T, I}. AMNp N M, N {, T, I} means that p N is spread to its neighbor nodes through the transition matrix A MN. First we discuss the extreme case that α equals to 0. Taking p T as an example, p T receives scores from p through A T, p T through A TT and p I through A TI. For t T, p T t will have a high score if t has highly ranked user neighbors, tag neighbors and item neighbors. The same rule applies to p and p I. In other words, users, tags and items reinforce each other iteratively until a stable state is reached. However, there is no personalized information considered. Given a user u and an item i for tag recommendation, when α is greater than 0, the random walker will restart at u and i. Besides reinforcement rule, p, p T and p I are also influenced by the distance from u and i. Nodes that are near to u and i will get a higher ranking. 4. OPTIMIZATION BASED FRAMEWORK In this section, we focus on how to find the best feature weights to achieve an optimal random walk with restart. Section 4. describes the objective function for optimization. Section 4.2 introduces how to solve the optimization problem. Section 4.3 introduces the derivatives of the random walk with respect to the feature weights, which belongs to the details in solving the optimization problem. 4. Objective Function As we introduced in Section 2, we want to maximize the average AC of the tag recommender according to Equation 3. To convert this problem into a minimization problem, we can rewrite Equation 2 to an equivalent form: i PT j NT AC = Ip T j p T i PT NT 6 This equation tells us that to maximize AC is equivalent to minimize i PT j NT Ip Tj p T i/ PT NT. We propose an equivalent minimization problem of Equation 3: min Jθ, ξ = m i PT k j NT k Ip T j p T i θ,ξ m k= 7 Since Jθ, ξ is not differentiable, we can use the sigmoid function with parameter β as a differentiable approximation: Sx; β = + e βx 8 The bigger the β is, the smaller the approximate error is. However, when β is big, the steep gradient will cause a numerical problem. β is empirically assigned. Now we have a 279

5 new objective function: min Jθ, ξ = m i PT k j NT k Sp T j p T i θ,ξ m k= Solving the Optimization Problem We use gradient descent to solve the optimization problem. The basic idea of gradient descent is to find the direction gradient that the objective function drops down and make a small step towards the direction to update θ and ξ. However, the cost function defined in Equation 9 requires to sum up all the training instances to perform one update, which is too costly. So we update θ and ξ based on each training instance, which is called stochastic gradient descent. The algorithm is shown in Algorithm. Algorithm : Stochastic Gradient Descent Input: m training instances lr: learning rate Output: optimal θ and ξ t=0; 2 initialize θ 0 and ξ 0 ; 3 while Jθ, ξ has not converged do 4 Randomly shuffle the m training instances; foreach training instance k do 5 θ t+ = θ t - lr J kθ t,ξ t 6 ξ t+ = ξ t - lr J kθ t,ξ t 7 t = t + ; where J k θ, ξ is the cost based on the k-th instance: i PT J k θ, ξ = k j NT k Sp T j p T i 20 Learning rate lr decides the step size towards the dropping direction. The random shuffle at Line 4 is required by stochastic descent for convergence. The updating rules for θ and ξ are shown in Lines 5 and 6. We will discuss how to compute J k θ, ξ/ and J k θ, ξ/ in detail in the following. J k θ, ξ J k θ, ξ = = i PT k j NT k Sδ ji δ ji i PT k j NT k Sδ ji δ ji T j T j i i 2 22 where δ ji = p T j p T i. Sδ ji/ δ ji is easy to compute. According to Equation 8, we can derive that Sδ ji/ δ ji = βsδ ji Sδ ji. The remaining question is how to compute j/ and i/, which will be discussed in the next section. 4.3 Derivatives of the Random Walk In this section, we will discuss how to compute the derivatives of the random walk. Suppose p T = p T,p T T,p T I T, we want to compute / and /. The basic idea is that we can derive a similar iterative way to compute derivatives from the definition of random walk. Derivatives with respect to θ. Since / is composed of / MN M, N {, T, I}, without loss of generality, we introduce how to compute /. Taking the derivatives with respect to θ on both sides of the Equations 3, 4, 5, we can get = α = α = α N {,T,I} N {,T,I} N {,T,I} A N N + AN p N A N TN + ATN p N A N IN + AIN p N Following the same rule, we can compute the derivatives with respect to any θ MN M,N {, T, I}, which all lead to the same form with the above three equations. To better illustrate the connections between computing p and computing / MN, we can rewrite the above three equations with θ replaced by θ MN in the matrix form: MN MN MN = αa MN MN MN + α A MN p p T p I 26 where A is the transition matrix defined in the original random walk. Comparing the above equation with Equation for computing p, we can find two differences: p is replaced by / MN. 2 The last term on the right side is totally changed. However, only the first term αa/ MN decides whether Equation 26 will converge to a stable state. More details about the convergence are discussed in the appendix. The last detail is how to compute A/ MN. Without loss of generality, we discuss how to compute A/. Recall that A is composed of sub-matrices {A MN M, N {, T, I}} and not all A MN are related with. According to Equation 7, only A, A T and A I can be influenced by θ. So we only need to compute A /, A T/, A I/. Take A / for example, we can get A = A D D + A 27 θ Each entry of A is defined according to Equation 5. For u, u 2, we have A u, u 2 = f edgeθ T X u, u 2 28 Each entry in the diagonal of D is the out-degree of a user. According to Equation 8, for u, the derivative is D u, u = M {,T,I} A k,u k= M k= AMk, u2 29 So far we have explained how to compute A /. The same process can be used for computing A T/ and A I/. Derivatives with respect to ξ. Computing / is analogous to computing /. Since / is composed of / M {, I}, without loss of generality, we 280

6 first focus on how to compute /. Taking the partial derivatives with respect to ξ on both sides of Equations 3, 4 and 5, we can get = α = α = α N {,T,I} N {,T,I} N {,T,I} A N N + α q 30 A TN N + α q T 3 A IN N + α q I 32 Following the same rule, / I can also be obtained. Replacing ξ I with ξ M M {, I}, we can rewrite the above three equations to a single equation in the matrix form: = αa + α q q T q I 33 From the above equation, we can see that computing / also has the same form with Equation. More details on the convergence will be discussed in the appendix. The last detail is how to compute q/ M {, I}. Without loss of generality, suppose M is, according to Equation, we have q = q D q 0 Dq q I Dq + q 34 Each entry of q is defined according to Equation 9. For u, we have qu = f nodeξ T Y 35 When f node is the sigmoid function, we know that df node x / dx = [f node x][ f node x]. Dq is defined according to Equation 2 and the derivative is D q = M {,I} M q k k= M k= q M k2 36 So far we have described how to compute q/. The same process can be performed to compute q/ I To sum up, we have introduced how to compute / and /, which can be summarized by Algorithm EXPERIMENTAL STDY 5. Datasets We test OptRank on two publicly available datasets 6 : Delicious and Last.fm, which are published by [2] as benchmarks. Delicious contains posts involving 867 users, tags, items, 5328 user relations, tag relations and 597 item relations. All types of intrarelations we studied are included in Delicious. Posts are represented by <user, tag, item>. Last.fm contains 2464 posts involving 892 users, 9749 tags, 2523 items and Algorithm 2: Derivatives of the random walk Input: Transition matrix A and preference vector q Output: and t=0; 2 Initialize p t 3 while p has not converged do 4 p t+ = - αap t + α q 5 t = t + ; 6 t=0; 0 ; 7 Initialize 8 while has not converged do 9 Computing t+ according to Equation 26 0 t = t + ; t=0; 0 ; 2 Initialize 3 while has not converged do 4 Computing t+ according to Equation 33 5 t = t + ; user relations. Last.fm is a smaller dataset and only user relations are available. We introduce each type of relation and its features as follows. Inter-relation. For edge <m,n> m M n N M, N {, T, I} M N, the feature vector X MNm, n = the times of m co-occurred with n in the posts. For example, suppose we have <u, t> u t T, X Tu, t = the times of u co-occurred with t in the posts, which means the times of t used by u. In our experiments, we use the same feature set to denote <m,n> and <n,m>. This means that A MN and A NM are both decided by ξ MN and θ MN. ser Relation. ser relations are formed by the social network. Each relation is bi-direction and binary weighted. To find the strength of a user relation, we check their items and tags in common. More formally, user u can be represented by an item vector A I, u and a tag vector A T, u. Each entry of A I and A T is re-weighted by TF-IDF. sers and items can be viewed as documents and words in the information retrieval. Let ÃMN M,N {, T, I} denote the A MN re-weighted by TF-IDF. For edge <u, u 2> u, u 2, the feature vector X u, u 2 = [cosãt, u, ÃT, u 2, cosãi, u, ÃI, u2] Tag Relation. Tag semantic relatedness is computed with the help of Wikipedia 7. To be more specific, 47% tags are article titles in Wikipedia. Articles link to each other by anchor texts. Semantic relatedness of tag pairs can be inferred from the the number of links between article pairs. We use WikipediaMiner [0], which is an off-the-shelf tool, to calculate semantic relatedness. Only tag pairs that have semantic relatedness larger than 0.25 are retained. To refine the edge weights, tags are also represented by user vectors and item vectors. We perform the same TF-IDF weighting technique to A T and A IT. Let ÃT and ÃIT denote the TF-IDF weighted matrix. For edge t, t 2 t, t 2 T, edge

7 feature X TTt, t 2 = [semantic relatedness, cosãt, t, Ã T, t 2, cosãit, t, ÃIT, t2]. Item Relation. We calculate item similarities based on Web page titles in Delicious. A title is a vector of words with TF-IDF weighting on each entry. Besides content similarities, we refine edge weight with TF-IDF weighted ÃI and Ã TI. For <i, i 2> i, i 2 I, X IIi, i 2 = [costitle, title 2, cosãi, i, ÃI, i2, cosãti, i, ÃTI, i2] Like logistic regression, we add a constant feature to each feature set X MN and all the features are normalized to have mean 0 and standard deviation. 5.2 Baselines Since OptRank is an extension of existing graph-based methods, we want to prove two points: OptRank outperforms existing graph-based methods when only <user, tag, item> is available. 2 OptRank further improves the performance by incorporating social networks, tag semantic relatedness, item content similarities. we choose two graph based methods as our baselines. Random Walk with Restart. Random Walk with Restart, called RWR for short, is the unsupervised version of OptRank. RWR performs on the graph defined by <user, tag, item>. The weight of the edge <m,n> m M n N M, N {, T, I} M N is the times of m cooccurred with n in the posts. Given a user u and an item i for tag recommendation, when the random walker decides to restart, it has the probability of 0.5 to restart at u and 0.5 to restart at i. RWR has been adopted in [6] to incorporate social networks, but the different types of edges are normalized empirically and are hard to reproduce. FolkRank. FolkRank is a state-of-the-art graph-based algorithm. The graph is defined in the same way with RWR. FolkRank can be summarized as three steps: Calculate a global PageRank score p global for each node. 2 Calculate a personalized PageRank score p pref with special preference to u and i for each node. 3 Calculate FolkRank score as the wins and loses between the personalized PageRank p pref and the global PageRank p global, i.e., score = p pref p global. In our experiments, we set the damping factor to 0.7, which achieves the best performance for FolkRank. In our experiments, FolkRank is denoted by FR. We are aware that there are many methods based on tensor factorization[2, 3, 5]. However, tensor factorization needs to learn a low rank approximation vector for each user, item and tag. In OptRank, a user can even not exist in the training set but can still get recommendation if she/he has neighbors in the test set. OptRank only needs about 3000 training instances to reach its best performance. However, tensor factorization would fail with such a small training set, which is unfair. For this reason, we did not choose these methods as baselines. 5.3 Evaluation Methodology Performance Measurements. We use average precision, precision-recall curve and average AC Area under the ROC Curve to measure the performance. We are aware that the optimal of AC is not necessarily the optimal of average precision/recall. To trade-off between best AC and best precision, we choose the model that has both high AC and precision in the cross validation set. Then the model is evaluated on the test set. AC AC Precision α Figure 3: The effect of α Precision Training/Cross Validation/Test Set. Posts are aggregated into records <u, i, PT> u, i I. For each dataset, we randomly picked 5000, 3000, and 3000 records as the training set, cross validation set, and test set. 5.4 Parameters β and learning rate. β in Equation 8 controls the error of approximating Ix. The bigger β is, the smaller the approximate error is. However, when β gets too big, the derivative at x = 0 will also get too steep and will cause a numeric problem. When β gets too small, minimizing Jθ, ξ would fail to maximize AC. From Equations 2 and 22, we can know that the summation of the derivatives is divided by. Since large dataset has big, we can use a big β. In our experiments, β is 0 9 in Delicious and 0 6 in Last.fm. Learning rate lr is strongly related to β. When lr gets too big, stochastic gradient descent would fail to converge. lr is set to 0 in both datasets. Restart Probability α. α controls how frequently the random walker chooses to restart. We evaluate how AC and precision change by differing α from 0.2 to 0.8 in Delicious. The results are shown in Figure 3. OptRank was run on the inter-relation formed by <user, tag, item>. When precision and AC are both considered, α [0.6, 0.8] seems to be a good choice. Finally, we set α to Experimental Results Results on Delicious. Results on Delicious are shown in Table and Figure 4. FR denote FolkRank. OptRank Edge, OptRank Node and OptRank EN denote OptRank with only edge features enabled, only node features enabled, and both features enabled, respectively. Firstly, we compare the algorithms that are only performed on inter-relations formed by <user, tag, item>. Since RWR always performs better than FR, we only compare OptRank with RWR in the following. When only edge features are enabled, OptRank Edge has comparable performance with RWR. This indicates that the original transition matrix of RWR and FR is nearly optimal. When only node features are enabled, OptRank Node learns the best weights for u and i, which improves the top- precision by 3.3% compared with RWR. This indicates the original node weight is not optimal. When edge features and node features are both enabled, OptRank EN futher improves the top- precision by 2.4% based on OptRank Node. From Figure 4 we can know that the OptRank EN outperforms RWR at top-5 but the advantage disappears at top-0. However, since a user usually annotates an item with less than 5 tags, top-5 performance is considered more important than top-0 performance. In terms of AC, FolkRank 282

8 Precision FR RWR OptRank_EN OptRank_I OptRank_TI Recall Figure 4: Precision-Recall Curve on Delicious Table : Precision and AC on Delicious Algorithm P@ P@2 P@3 AC FR RWR OptRank Edge OptRank Node OptRank EN OptRank OptRank T OptRank I OptRank TI has relatively poor performance, worse than the precision. In contrast, RWR has a much better average AC. This is probably because FolkRank is an empirically designed algorithm and relies too much on the global information. We can see that a high precision does not indicate a high AC. Now we discuss how OptRank performs when extra user relations, tag relations and item relations exist. When each type of relation is considered separately, OptRank, OptRank T and OptRank I improve the top- precision by around % based on OptRank EN, which is not very significant compared with previous improvement. However, as we can see from Figure 4, the top-0 performance of OptRank I is significantly improved compared with OptRank EN. Since OptRank, OptRank T and OptRank I are comparable, only OptRank I is shown in Figure 4. When all the relations are combined, we can see that OptRank TI achieves the best performance at all top-k performance. In terms of AC, OptRank TI also achieves the best performance. Results on Last.fm. The results are shown in Figure 5 and Table 2. The results are significantly better than the results on Delicious. This can be explained in terms of data sparsity. When only inter-relations are considered, a post can be viewed as an entry in the three-dimension array spanned by users, tags and items and of the entries in Last.fm and Delicious are known, respectively. Thus Last.fm is less sparse and more predictable than Delicious. In Last.fm, all algorithms have comparable performance at top-0. So we mainly focus on the top-5 performance in this experiment and this is reasonable since users usually annotate an item within 5 tags. From Figure 5 we can know that FolkRank and RWR have comparable performances in term of precision, which is different from the results on Delicious. When node features and edge features are both considered, OptRank EN improves P@, P@2 and P@3 by.2%,.5%,.5% respectively compared with FR and RWR. Although Last.fm is less sparse than Delicious, Precision FR RWR OptRank_EN OptRank_ Recall Figure 5: Precision-Recall Curve on Last.fm Table 2: Precision and AC on Last.fm Algorithm P@ P@2 P@3 AC FR RWR OptRank EN OptRank when the social network is combined, OptRank successfully improves the top- precision by 3.7% compared with the two baselines. In terms of AC, empirically designed FR still falls behind other methods. OptRank achieves the highest AC. To sum up, we have two conclusions from the experiments: When only <user, tag, item> is available, OptRank outperforms RWR and FolkRank. 2 OptRank successfully combined extra relations to improve the performance. Now we discuss some details about the training process. Since Delicious is bigger and takes more time, the training size and running time are reported according to Delicious. Over-fitting Issues. Over-fitting does not seem to be a problem in our model since we only have 8 parameters when all the relations are combined 8. OptRank TI achieves nearly the same top- precision in the cross validation set and test set. Training Size. The training size is really small compared with tensor factorization. OptRank EN achieves its best performance when 600 training instances are passed. OptRank TI achieves its best performance when 200 training instances are passed. Running Time. The experiments were conducted on a single PC with a 2-core 3.2GHz CP and 2G main memory. We implemented the algorithm in Matlab with full vectorization. When all the relations are combined, each training instance takes nearly 3.5 seconds. Prediction takes around 0. seconds per instance and most of the time is spent on computing the gradients. Training with 5000 instances would take 4.8 hours at most. However, all the algorithms in our experiments achieve their best performance within 2000 training instances. 6. RELATED WORK There are mainly three approaches for personalized recommendation in social tagging systems: Graph-based approach [4, 5, 7]. 2 Tensor decomposition [2, 3, 5]. The annotation relation is modeled as a cube with many 8 Each inter relation has 2 parameters, user relation has 3 parameters, tag relation and item relation has 4 parameters, respectively. 283

9 unknown entries. After performing tensor decomposition, we can predict the unknown entries by low-rank approximations. 3 ser/item based collaborative filtering [8,, 8]. The original user-item matrix is extended by including tag information so that we can apply user/item based collaborative filtering methods. Besides annotation behaviors, user space, tag space and item space have also been explored. [9] has studied trust networks and proposed a factor analysis approach based on probabilistic matrix factorization. [6] incorporates social network for item recommendation, but fails to improve the performance significantly. [4] links social tags from Flickr into WordNet. [7] introduces item taxonomies into recommender systems. This paper is mainly inspired by two recent work on graphbased learning [] and semi-supervised learning [3]. [] proposes supervised random walks to learn the edge weights for link prediction in homogenous graph. This paper extends [] with multi-type edges and nodes. [3] has proposed similar idea to learn edge weights and node weights with an inductive learning framework in homogenous graph. Since a recommender should have the ability to predict for future events, our framework is different from [3] in that ours belongs to transductive learning. 7. CONCLSION AND FTRE WORK In this paper, we propose an optimization-based graph method for personalized tag recommendation. To alleviate data sparsity, different sources of information are incorporated into the optimization framework. There are some problems unsolved for future work: Reducing the graph size. Since the random walker frequently restarts at u and i, nodes that are far away from u and i may be cut without influencing the final ranking. 2 Comparing with tensor factorization methods under a suitable experiment setting. 3 More features can be explored to further improve the results, such as the temporal factors. 8. ACKNOWLEDGMENTS This work was supported in part by National Basic Research Program of China 973 Program under Grant No. 20CB302206, and National Natural Science Foundation of China under Grant No REFERENCES [] L. Backstrom and J. Leskovec. Supervised random walks: predicting and recommending links in social networks. In WSDM, pages , 20. [2] I. Cantador, P. Brusilovsky, and T. Kuflik. Workshop hetrec 20. RecSys 20. ACM, 20. [3] B. Gao, T.-Y. Liu, W. Wei, T. Wang, and H. Li. Semi-supervised ranking on very large graphs with rich metadata. In KDD, pages 96 04, 20. [4] Z. Guan, J. Bu, Q. Mei, C. Chen, and C. Wang. Personalized tag recommendation using graph-based ranking on multi-type interrelated objects. In SIGIR, pages , [5] R. Jäschke, L. B. Marinho, A. Hotho, L. Schmidt-Thieme, and G. Stumme. Tag recommendations in folksonomies. In PKDD, pages , [6] I. Konstas, V. Stathopoulos, and J. M. Jose. On social networks and collaborative recommendation. In SIGIR, pages , [7] H. Liang, Y. Xu, Y. Li, and R. Nayak. Personalized recommender system based on item taxonomy and folksonomy. CIKM 0, pages , 200. [8] H. Liang, Y. Xu, Y. Li, R. Nayak, and X. Tao. Connecting users and items with weighted tags for personalized item recommendations. HT 0, pages ACM, 200. [9] H. Ma, T. C. Zhou, M. R. Lyu, and I. King. Improving recommender systems by incorporating social contextual information. ACM Trans. Inf. Syst., 29:9: 9:23, Apr. 20. [0] D. Milne. An Open-Source Toolkit for Mining Wikipedia, volume [] J. Peng, D. D. Zeng, H. Zhao, and F.-y. Wang. Collaborative filtering in social tagging systems based on joint item-tag recommendations. CIKM 0, pages ACM, 200. [2] S. Rendle, L. B. Marinho, A. Nanopoulos, and L. Schmidt-Thieme. Learning optimal ranking with tensor factorization for tag recommendation. In KDD, pages , [3] S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM, pages 8 90, 200. [4] B. Sigurbjörnsson and R. van Zwol. Flickr tag recommendation based on collective knowledge. WWW 08, pages ACM, [5] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. Tag recommendations based on tensor dimensionality reduction. In RecSys, pages 43 50, [6] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis. TKDE, 222:79 92, 200. [7] H. Yildirim and M. S. Krishnamoorthy. A random walk method for alleviating the sparsity problem in collaborative filtering. In RecSys, pages 3 38, [8] Y. Zhen, W.-J. Li, and D.-Y. Yeung. Tagicofi: tag informed collaborative filtering. RecSys 09, pages ACM, APPENDIX We prove the convergence of Equations 26 and 33. Both the equations can be rewritten to a more general form: p t+ = λap t + µq where 0 λ, µ, A is a transition matrix with each column summing to and q can be any vector with the same dimension of p. Suppose p 0 = π, we have p = λaπ + µq, p 2 = λa 2 π + λaµq + µq,..., p n = λa n π + n k=0 λak µq. Since 0 λ, µ and the eigenvalues of the transition matrix A are in [-, ], we have lim n λa n = 0 and lim n n k=0 λak = I λa. So p n finally converges to p = I λa µq. 284

Learning Optimal Ranking with Tensor Factorization for Tag Recommendation

Learning Optimal Ranking with Tensor Factorization for Tag Recommendation Steffen Rendle, Leandro Balby Marinho, Alexandros Nanopoulos, Lars Schmidt-Thieme Information Systems and Machine Learning Lab