Impact of Data Characteristics on Recommender Systems Performance

Impact of Data Characteristics on Recommender Systems Performance Gediminas Adomavicius YoungOk Kwon Jingjing Zhang Department of Information and Decision Sciences Carlson School of Management, University of Minnesota {gedas, kwonx052, zhang818}@umn.edu Abstract. This paper investigates the impact of rating data characteristics on the performance of recommendation algorithms. We focus on three groups of data characteristics: rating density, rating frequency distribution, and rating value distribution. We introduce a window sampling procedure that can effectively manipulate the characteristics of rating samples and apply regression model to uncover the relationships between data characteristics and recommendation accuracy. Our experimental results show that the recommendation accuracy is highly influenced by structural characteristics of rating data, and the effects of data characteristics are consistent for different recommendation techniques. Understanding how data characteristics can impact the recommendation performance has practical significance and can enable the recommender system designers to estimate the expected performance of their system in advance and, thus, to direct data collection efforts to maximize the recommendation performance. 1. Introduction and Motivation Recommender systems analyze patterns of user preferences and attempt to recommend items that are likely to interest users. In many applications, recommender systems use the notion of ratings to model user preferences for items, and the ratings in these systems can be used as inputs (i.e., known ratings for items that users provided in the past) as well as outputs (i.e., rating predictions for items that have not been consumed by users). A lot of research in recommender systems literature has been focusing on improving the recommendation performance (usually measured by the accuracy of rating predictions), typically by proposing novel and increasingly more sophisticated recommendation algorithms. It is well recognized that rating data can often be highly sparse and skewed and that this can have a significant impact on recommender systems performance [1,7]; however, little research has been dedicated to providing a systematic, in-depth exploration and analysis of this issue. Understanding this has great practical significance as well. For example, this would enable the recommender system designers to estimate the expected performance of their system in advance (i.e., based on simple data characteristics) and, thus, to direct data collection efforts to maximize the performance of a recommendation algorithm. This study uses regression-based explanatory modeling techniques to investigate which rating data characteristics can be associated with the variations in recommendation performance of several popular recommendation algorithms. In this paper, we focus on three categories of rating data characteristics: rating density, rating distribution, and rating values, as described below. 2. Measures for Rating Data: Proposed Approach and Related Work Overall Rating Density. Data sparsity is a problem that is common to most recommender systems due to the fact that users typically rate only a small proportion of the available items [1]. This can be aggravated by the fact that users or items newly added to the system may have no ratings at all, known as a cold-start problem. If the rating data is sparse, two users are unlikely to have many items in common, thus, causing some algorithms (e.g., neighborhood-based CF techniques) to have poor performance in terms of recommendation accuracy, and there have been several studies dedicated to alleviating the sparsity problem [2,3,9]. In this paper, we use a traditional approach to measuring the overall rating density, i.e., by calculating the proportion of known ratings (i.e., provided by the users to the system) among all possible ratings that can possibly be given by the users. The operating assumption is that sparser rating data would lead to lower accuracy, and we will explore this relationship with different density configurations. Rating Frequency Distribution. Recommender systems data (e.g., in retail shopping applications) is typically not only sparse but also skewed, often exhibiting long-tail distribution,

where few ( popular ) items are bought very frequently, but most items are bought only very few times. In other words, popularity of an item is typically represented by the number of users who bought/rated it. For example, up to 80 percent of all Netflix movie rentals are long-tail movies [4] and Netflix is making more money with the long tail (i.e., the back-catalog titles and indie movies that cannot be easily found at a local Blockbuster or Best Buy) than with current releases [12]. However, long-tail items often have limited historical data and, if included in recommendation models, can possibly lead to a decrease in the performance of recommender systems. We explored several measures for characterizing the structural distribution of rating data, including the basic shape of frequency distribution of user or item ratings (using the first four moments: mean, variance, skewness, kurtosis) as well as the concentration of items or users in the frequency distribution (using standard measures, such as Gini coefficient and Herfindahl index). Many of these metrics were highly correlated for the rating datasets that we had; after preliminary analysis (e.g., elimination of redundant/correlated metrics), skewness and Gini coefficient were chosen as best representative metrics for different aspects of rating distribution. In particular, skewness (also called the third central moment in mathematical statistics) measures the asymmetry of an item (or user) frequency (popularity) distribution, and is defined as: 3 3/ 2 X 1 n 3 1 n 2 Skewness = E μ = ( x ) ( ) i= 1 i μ x i= 1 i μ, σ n n where μ is the mean and σ is the standard deviation of item popularity, and x i represents the popularity of item i, as defined by the number of users who rated it. This metric determines whether the mass of the distribution is concentrated on the right side of the mean or the left side of the mean, i.e., representing negative or positive skewness as shown in Fig. 1a. In contrast, Gini coefficient measures the concentration (or inequality) of an item (or user) frequency distribution [6]. It is a commonly used metric of wealth distribution inequality in economics and can be computed as the ratio of the area between the line of equality and the Lorenz curve (shown in Fig. 1b) that represents cumulative frequency of users or items arranged in the ascending order based on their popularity (i.e., area A), over the total area under the line of equality (i.e., area A+B). Formally, for discrete distributions it can be defined as follows: n n + 1 i xi Gini = A ( A + B) = 1 B ( A + B) = 1 2 =, i 1 n + 1 total where x i is the popularity of item i, n is the total number of items available in the dataset, and total is the total number of ratings. Thus, a value of 0 represents total equality (all items are equally popular), and a value of 1 represents maximal inequality (a few popular items have all the ratings). These two metrics quantify two different properties of an item (or user) frequency distribution. For example, our experiments show that skewness is sensitive to the number of items that are below/above the average item popularity, but Gini coefficient is sensitive to the actual item popularity numbers (e.g., the frequency of the most popular vs. least popular item). Fig. 1c shows an example of two distributions with the same exact Gini coefficient (0.23), where one distribution has a positive (1.72) and the other negative (-0.75) skewness. Rating Value. After exploring several basic statistics of rating value distribution, we found the rating variance to be the most informative rating-value-related measure when analyzing the impact of data characteristics on recommender systems performance. Rating variance has been explored in recommender systems literature before. For example, uncertainty of an item s rating is often measured by its variance, and a highly controversial item tends to have higher rating variance in the user population. High variance data is not necessarily considered bad data, but can be the

cause of recommendation errors [7]. There has also been some positive evidence for high rating variance. For example, when faced with two movies with pre-calculated ratings, consumers have been found to prefer the high variance movie [11]; also, in some settings high rating variance can positively affect consumers purchase decisions and increase subsequent demand and profit [13]. (a) Skewness (b) Gini coefficient (c) Example: different skewness with the same Gini coefficient Figure 1. Rating distribution metrics: skewness and Gini coefficient Analyzing Impact of Data Characteristics on Recommendation Performance. While prior literature provides some discussion on how individual data characteristics may affect recommendation performance, in this paper we propose a more systematic way to explore the underlying relationship between a set of representative data characteristics and recommendation accuracy. This analysis provides new insights and understanding about which data characteristics play a more important (or less important) role and, as a result, can help system designers in improving the performance of recommender systems. Using the aforementioned metrics (i.e., rating density, movie skewness, user skewness, movie Gini, user Gini, and rating variance) that, we believe, can play an important role in explaining variations in recommendation accuracy, we build the following explanatory model, the analysis of which will be discussed in the next section: Recommendation Accuracy = β 0 + β 1 *Density + β 2 *movieskewness + β 3 *userskewness + β 4 *moviegini + β 5 *usergini + β 6 *. 3. Experimental Results Dataset and Sampling Procedure. We used the publicly available MovieLens 1M movie rating dataset (available at movielens.org) to examine the relationship between data characteristics and accuracy of popular recommendation algorithms. This dataset consists of 1 million ratings for 6040 movies by 3952 users (i.e., data density is 4.2%). All ratings are integers between 1 and 5. In order to create datasets with different data characteristics, we propose to use the window sampling procedure for extracting a variety of samples from the original dataset. For data preparation, ratings are read into a rating matrix, each column representing an item and each row representing a user. Rows and columns of the rating matrix are then rearranged according to the frequency distribution, i.e., the first column represents the most rated movie and the first row represents the user who provided the most ratings, and so on. As illustrated by Fig. 2a and 2b, after rearranging rows and columns, the new rating matrix becomes more extremely distributed than original matrix, i.e., data is highly dense in one corner but highly sparse in the opposite one. A rectangle-shaped window of a fixed size is then moved around the sorted rating matrix, and ratings that fit within the boundaries of the window are extracted as a sample. Fig. 2c visualizes the window and some of the samples obtained from sorted rating matrix based on Movielens 1M dataset. In our experiments, size of the rectangle window was set to 300 users 200 items. The step size each time the window moves was set to be proportional to the rating density, so that the

window moves slower when data is dense and faster when data is sparse. The reason for adjusting movement speed is because the majority of the original matrix is very sparse and, hence, we need to extract enough samples from the denser area to ensure the richness of sample representation. In addition, to ensure that each sample has a sufficient amount of data for recommendation algorithms to make meaningful predictions, only samples that have rating density above 6% are considered in our experiments. We used this specific threshold to ensure that the recommendation algorithm versions tested in our experiments all have sufficient prediction coverage. (Experiments with lower thresholds produced comparable results in the follow-up study.) In total, we extracted 1384 samples exhibiting varying characteristics in terms of rating density, distribution, and value. Sample 1 Sample 2 Sample 3 Sample n (a) Original Matrix (b) Rearranged Matrix (c) Window Sampling Figure 2. (a) Original Rating Matrix, (b) Rearranged Matrix, and (c) Window Sampling Illustration Recommendation Algorithms. In this paper we focus on two of the most widely used collaborative filtering (CF) techniques for recommender systems neighborhood-based and matrix factorization CF approaches to test the general validity of the proposed premise. A neighborhood-based CF approach predicts unknown ratings of a user based on the ratings of the nearest neighbor users who have similar rating patterns [2]. This technique can be user-based (as described above) or item-based, if the ratings of the nearest neighbor items are used to predict unknown ratings of an item [3]. Matrix factorization technique estimates each unknown rating as an inner product of user- and item-factors vectors that indicate the preference of a user for several latent features and an item s importance weights for the same number of features, respectively [5,10]. Many variations of matrix factorization techniques have been developed in the Netflix Prize competition and have proven to be effective in terms of recommendation accuracy. The basic version of the matrix factorization technique [5] as well as both user- and item-based neighborhood CF approaches were used in our experiments. Also, our experiments follow the standard process of 5-fold cross validation and test the impact of dataset characteristics on prediction accuracy of three recommendation techniques mentioned above. Recommendation accuracy was measured by the standard measure of root mean squared error (RMSE) [8]. Regression Analysis. As discussed earlier, in our study we chose to use 6 independent variables (IVs): rating density, gini coefficients and skewness indices for user and item rating frequency distributions, and rating variance. The dependent variable is the prediction error of each of the three popular recommendation algorithms as measured by RMSE. A linear regression model was built to explain the relation between six IVs and DV for each algorithm. All IVs were centered by subtracting their mean. This zero-mean centering transformation does not affect the relationship between variables, but it makes the regression models easier to interpret. A thorough examination of correlation of IVs did not raise collinearity issues. Regression results of the three recommendation techniques are summarized in Table 1. Results of data analysis show that the six data characteristics can explain 80.1, 75.7, and 79.7% of the

variance in recommendation RMSE made by item-based CF, user-based CF, and matrix factorization techniques, respectively. Hence, we can draw the conclusion that data characteristics play a significant role in determining the accuracy of recommendation algorithms. Table 1. Summary of Regression Results Item-based CF User-based CF Matrix Factorization R 2 (Adjusted R 2 ).801 (.800).757 (.756).797 (.797) coefficient T coefficient T coefficient t Constant.956 1306.35 ****.967 1011.54 ****.940 2150.63 **** Density -.337-49.33 **** -.379-42.48 **** -.180-44.09 ****.158 13.89 ****.207 13.90 ****.135 19.81 **** moviegini.134 5.16 ****.056 1.65.034 2.23 * usergini.202 7.46 ****.192 5.43 ****.070 4.33 **** movieskewness -.031-9.58 **** -.031-7.31 **** -.020-9.97 **** userskewness -.035-13.01 **** -.035-10.13 **** -.020-12.83 **** **** p 0.0001, *** p 0.001, ** p 0.01, * p 0.05 Since IVs are centered to have zero mean values, the constant component in a regression model represents the expected recommendation accuracy (measured in RMSE) of a given technique on a dataset with average rating density, variance, and frequency distribution. Results suggest that, for a dataset with average characteristics, matrix factorization has the smallest expected RMSE (i.e., 0.940), followed by item-based CF (i.e., 0.956), and user-based CF has the largest expected RMSE (i.e., 0.967). This confirms the findings in recommender system literature that matrix factorization technique typically gives better accuracy than neighborhood-based collaborative filtering approaches [10], and item-based techniques often outperform user-based ones [3]. Overall, the relationships between six IVs and RMSE are found to be significant. Signs of regression coefficients for three models are consistent. For example, rating density has negative effects on RMSE for all three techniques (i.e., -0.337, -0.379 and -0.180, respectively), i.e., denser dataset leads to better accuracy. Comparing regression coefficients of density, it seems that matrix factorization is less sensitive to data density than neighborhood-based CF approaches. As another example, rating variance has positive impacts on RMSE (0.158, 0.207 and 0.135), meaning that dataset with smaller variations in rating values tends to provide better accuracy. Table 2. Sequence of Variables in Stepwise Regression Model Movie-based CF User-based CF Matrix Factorization Predictors R 2 Predictors R 2 Predictors R 2 1 Density.696 Density.625 Density.567 2 Density.768 Density.730 Density.755 Final Density userskewness usergini movieskewness moviegini.801 Density userskewness movieskewness usergini.757 Density userskewness movieskewness usergini moviegini We also examine the level of importance of six IVs using stepwise regression. Sequence of variables added to the regression model is a proxy for the rank of contribution of the variables in explaining variations in RMSE. Stepwise regression results are provided in Table 2. For all three algorithms, dataset density is found to be the most important variable in the analysis. Including density alone in the model is able to explain 69.6, 62.5, and 56.7% of the variance in RMSE for item-based CF, user-based CF, and matrix factorization, respectively. Further, the next most.797

important variable is the rating variance. Together with density, the two variables can explain 76.8, 73.0, and 75.5% of the variations in RMSE for the three techniques. Frequency distribution statistics (i.e., Gini and skewness) are added to the models last, in slightly different orders. These four distributional metrics are able to provide further improvements to R 2. To check the robustness of our results, we ran a second set of experiments with Netflix Prize dataset. The findings (not presented here due to the space limitations) were consistent with results reported above. 6. Discussion and Conclusion The objective of this study is to investigate the relationship between dataset characteristics and accuracy of popular collaborative filtering techniques. To prepare datasets with varying characteristics, we introduce the window sampling procedure to extract samples characterized with different rating densities, rating frequency distributions, and rating value distributions. Our experiment results show that recommendation accuracy is highly influenced by structural characteristics of the dataset, and the effects of these characteristics are consistent across different recommendation techniques. Using six simple variables, one can explain about 80% of the variation in RMSE. Moreover, our results also show that, among the six characteristics, rating density is the most important variable in explaining variation of the recommendation accuracy, followed by rating variance, and then descriptive statistics for rating frequency distribution. In terms of practical implications, by analyzing customers usage and preference patterns and understanding the evolution of rating dataset characteristics, firms would be able to anticipate the performance changes of their systems. Further, this allows firms to strategically direct their efforts (e.g., by designing user interfaces that promote certain types of user-system interactions) toward more favorable data distributions in order to maximize performance of their recommender systems. We believe that the impact of dataset characteristics on accuracy of recommendation algorithms deserves our attention and exploration in both research and practice, and that significant additional work is needed to explore this issue in a more comprehensive manner. Acknowledgement This research was supported in part by the National Science Foundation grant IIS-0546443. References [1] G. Adomavicius and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-ofthe-art and possible extensions, IEEE Trans. on Knowledge and Data Eng., 17(6):734 749, 2005. [2] J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, In Proc. of the 14th Annual Conf. on Uncertainty in Artificial intelligence, pp. 43 52, 1998. [3] M. Deshpande, G. Karypis, Item-based top-n recommendation algorithms, ACM Transactions on Information Systems, 22(1), pp.143 177, 2004. [4] L.J. Flynn, Like This? You'll Hate That. New York Times, Jan 23 2006. [5] S. Funk, Netflix Update: Try This at Home, Available at: http://sifter.org/~simon/journal/20061211.html, 2006. [6] C. Gini, Measurement of Inequality and Incomes, The Economic Journal, 31, pp 124-126, 1921. [7] J. L. Herlocker, J.A. Konstan, and J. Riedl, Explaining Collaborative Filtering Recommendations, In Proc. of the 2000 ACM conference on Computer supported cooperative work, pp.241 250, 2000. [8] J.L. Herlocker, J.A. Konstan, L.G. Terveen, and J. Riedl, Evaluating collaborative filtering recommender systems, ACM Transaction on Information Systems, 22(1), pp. 5 53, 2004. [9] Z. Huang, H.Chen, D. Zeng, Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering, ACM Transactions on Information Systems, 22(1), pp.116 142, 2004. [10] Y. Koren, R. Bell, and C. Volinsky, Matrix Factorization Techniques For Recommender Systems, IEEE Computer Society, 42, pp. 30-37, 2009. [11] J. Martin, G. Barron, and M.I. Norton, Choosing to Be Uncertain: Preferences for High Variance Experiences. Working Paper, Harvard Business School, 2007. [12] J. Roettgers, Warner Bros.-Netflix Deal is All About the Long Tail, http://newteevee.com/2010/01/08/warnerbros-netflix-deal-is-all-about-the-long-tail, Jan 2010. [13] M. Sun, How Does Variance of Product Ratings Matter?, http://ssrn.com/ abstract=1400173, 2010.