Prediction of Citations for Academic Papers

Size: px

Start display at page:

Download "Prediction of Citations for Academic Papers"

Dorthy Newman
5 years ago
Views:

1 Prediction of Citations for Academic Papers Utkarsh Simha Sindhura Raghavan Sai Spoorthy Padigi University of California, San Diego Abstract We aim to approach the problem of literature search for an unpublished academic paper through a predictive technique of paper citation. In this project, we undertake the task of predicting whether one academic paper will cite another paper, given a pair of academic papers. We explore the characteristics of a subset of the DBLP dataset, a computer science bibliography that provides a comprehensive list of research papers in computer science. We examine the correlation between features of a pair of papers, such as the author similarity, popularity of publication conference, co-citation, etc., with respect to the classification label. We experiment with various models such as Logistic Regression, Multilayer Perceptron for the task of classification. 1 Introduction The process of finding research literature and related work can be simplified by predicting citations from a current paper, based on the paper s abstract and content. To achieve this, we model paper citation as the task of predicting whether a one-directional citation is possible in a pair of academic papers. In the first section, we analyze the data in an exploratory fashion to understand the available features of each paper publication, the scope of each of these features, and the relevance of these features to each other. In the next section, we use this information to explore the tasks of predicting if one paper would cite another given a pair of papers. For this, prediction task we further explore relevant features that could affect the probability of citation for two papers. We next explore the task of predicting if a paper A would cite a paper B. We investigate the correlation with features such as years of publication of both the papers, the author similarity of both the papers, citation counts for the conference venues and the co-citation count of both the papers. For the task of prediction, we begin by using logistic regression for classification based on a simplistic feature set. We then expand the same model using a feature set that was built after exploring and analyzing the dataset to find correlation between two papers. Lastly, we draw conclusions from the results of our experiments and delineate the reasons for why our improvements and feature selection worked. 1.1 Notation We shall refer to a paper p i citing a paper p j as p i p j. Henceforth we shall refer to each such pair of paper as p i and p j unless mentioned otherwise. We shall use set notation to refer to in-cite or out-cite sets. In-cite set for a paper p j can be written as { p k p k p j }, that is the number of papers p k that cite a paper p j. Similarly, out-cite set for a paper p i can be written as { p k p i p k }, that is the number of papers p k that have been cited by a paper p i. 1

2 Exploratory Analysis In this section, we examine the characteristics of the data and the various features of the dataset as a whole. 2.1 The Dataset The DBLP computer science bibliography contains the metadata of over 3.6 million publications, written by over 1 million authors in several thousands of journals or conference proceedings series. The dataset [1] has all important journals of computer science since Each paper contains the following metadata fields: 1. ID: The ID of the paper, a unique identifier for each paper in the integer format 2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Authors: The list of authors for the paper 5. Conference: The conference venue at which the paper was presented 6. References: The list papers that the paper cites as references 7. Year: The year the paper was published We observed that a lot of papers did not have the abstract present. Hence, we considered only the set of papers that had an abstract to be a part of our dataset. Next, we obtain a dataset for our prediction task, by creating 600K pairs of papers from the existing dataset of the form (p i, p j ). For each pair, we created the labels by assigning 1 if p i cites p j and a 0 if it doesnt. 2.2 Features We gather features from the meta-data of each paper pair. The premise of a good feature is one that helps us predict whether p i p j. Features that can be obtained from the metadata can be broadly text-based and citation-statistics based. While semantic similarity between abstracts of the paper and the TF-IDF representations of the abstract are useful to model text-based correlations, the citation-statistics based features capture statistics of citations with respect to the authors of both papers and conferences in which they are published. Neither text-based nor citation-based features performed well in isolation as explored in Strohman [2] Following are the list of features we modeled for our prediction task for a pair of papers: 1. Author similarity: It is the Jaccard similarity between the author sets of paper p i and paper ai aj p 2. This can be represented in set notation as follows: a i a Where a j i and a j represent the set of authors for papers p i and p j respectively 2. Author history: The history of the authors of the paper p i for citing the paper p j. That is, the number of times authors of the paper p i have cited the paper p j, through some other paper p k. This can be represented as follows: {p k p k p j } where p i and p k have been authored by the same author. 3. Venue pair citations: The number of times a paper at venue v i cits a paper at venue v j, where p i v i and p j v j. This can be represented as follows: {p i p i p j, p i v i, p j v j } 4. Co-citation score: The Jaccard similarity between the set of papers that have cited the paper p i and the set of papers that have cited the paper p j. We can represent this as follows: Let r i be the set of papers that cite p i. We can refer to this as the in-cite set of p i. That is, r i = { p k p k p i }. Similarly, for r j = { p k p k p j }. Then the co-citation score is computed as the jaccard similarity between these two sets: ri rj r i r j 5. Abstract Vector: 200 dimensional TF-IDF vectors of the abstract of the papers 2

3 Abstract similarity: This feature provides the cosine similarity between the TF-IDF vectors of the abstract between the two papers. The intuition is that if two papers belong to a similar topic, the chances are higher that paper p i cites paper p j. 7. Year 1: Year of publication of paper p i 8. Year 2: Year of publication of paper p j 2.3 Feature Selection Table 1: Statistics of Features Feature Min Max Mean Abstract Similarity Author similarity Co-citation score Venue pair citations Author history Year Year Figure 1: Histogram plots of Features. Table 2 provides the Pearson correlation score for each of the features chosen and the label, while table 3 provides the Pearson correlation between features and the co-citation count. Higher the absolute value of the Pearson correlation coefficient, greater the influence of a feature on predicting the output. From the Pearson correlation coefficients and the plots, we can observe that author, venue and abstract related features are highly correlated to the co-citation count as well as the final labels, which helps in selecting features for citation prediction. The co-citation count can be used to populate a matrix of ground-truth values for all permutations of paper pairs. This matrix can be used to train a Matrix Factorization model, while the features for paper pairs can be used for context-aware recommender systems such as Factorization Machines [4]. 3

162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211

424534 Author similarity 0.265333 Co-citation score 0.222043 Venue pair citations 0.284789 Author history 0.613941 Year 1 0.418595 Year 2 8.

4 Figure 2: Plot of number of citations across venues. Table 2: Pearson Correlation between features and labels Feature Pearson Correlation Co-efficient Abstract Similarity Author similarity Co-citation score Venue pair citations Author history Year Year e-06 3 Identifying the Predictive Task The task of citation prediction for literature search can be modeled in various ways such as predicting the citation count of a paper, co-citation count of a pair of papers, and lastly, predicting whether one paper cites another paper and using this information to rank papers that can be possible cited. 3.1 Co-citation prediction We initially explored the task of predicting co-citation count for a pair of papers {(p i, p j ) p i p j } using the selected features as feature space and the co-citation count as the ground truth. We experimented with regression models such as Gradient Boosted Regression, Linear Regression and recommendation algorithm - Factorization Machine. Prediction models for Co-citation prediction 1. Linear Regression: Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (features) denoted by X. For this task the set of features present in Section 2.2 represent X. The cocitation scores between papers p i and p j is the variable y. The equation for the model is: y = θ T X (1) 2. Factorization Machines: FMs are used to learn feature-pair-wise biases along with global and feature-wise parameters. 4

5 Table 3: Pearson Correlation between features and co-citation count Feature Pearson Correlation Co-efficient Author similarity Author history Venue pair citations Abstract Similarity Year Year The model equation is: Figure 3: Scatter plots of Features vs. Co-citation count. ŷ = w 0 + n w i x i + i=1 n n i=1 j=i+1 v i, v j x i x j (2) where w 0 R, w R n and V R nxk are the global bias, weights of i-th feature and row v i describes the i-th variable with k factors, respectively. L-2 Regularization hyper-parameters for w and V can be included and tuned for, along with standard deviation initialization of the parameters. The best MSE was obtained at init stdev = 0.1, l2 reg w = 0.01 and l2 reg V =

6 Evaluation Metric The evaluation metric used for this task was Mean Squared Error (MSE) since this is a regression task. MSE = 1 n (y ŷ) 2 (3) n 3.2 Citation prediction i=1 Table 4: Evaluation of Co-citation prediction Model Train MSE Test MSE Linear Regression Gradient Boosted Regression FM In this prediction task, we aim to predict if a paper p i cites paper p j, given a pair of papers (p i p j ). This task has been explained in the next section. 4 Literature Survey 4.1 Datasets 1. CiteSeer: One of the widely used datasets for academic citation meta-data is CiteSeer. Although the dataset is free to use, it requires permission from the Penn State University to access it. Thus we couldn t obtain this dataset. We tried to write a scraper and use it to try to scrape data off the CiteSeer website. This however did not work as they limit the number of pages to scrape to 50 which was insufficient for us 2. Microsoft Academic Graph: The MAG dataset is one of the other widely used datasets for citation prediction and recommendation. The data can be accessed through an API which provides 10,000 free calls per month. As the query format for this was very complex and wouldn t cater to our task, we decided to not use it. 3. arxiv: The arxiv dataset consisted of papers in High Energy Physics which was given as part of a KDD Cup contest. Unfortunately, the size of the dataset was limited to 30,000 papers and thus was insufficient for our task. 4. NIPS: One of the datasets on Kaggle was NIPS papers. This was mostly in the field of Machine Laerning and Deep Learning. The dataset contained only 8000 samples and thus was insufficient for our task. 4.2 Features 1. h-index: The h-index is a good measure both the productivity and impact of the published work and the author. The h-index is the best value of h such that h papers of the author have received atleast h citations. 2. Content based: The novelty, diversity, popularity and quality of a paper are assessed with the using the topic distribution of it s contents. 3. Temporal statistics: Statistics collected about the author and paper over a recent period of time can be used to quantify trends over time. 4. Citation graph: Citation graph clustering and mining, along with measures such as Katz distance measure (to determine relevance of a publication within a group of clustered papers) can be used to infer characteristics of a similar set of publications. 6

7 Other predictive tasks 1. h-index: Predicting h-index is indirectly a predictive measure of the impact of the publication 2. Citation count: Prediction of citation count of a paper and it s change over time is another predictive task 3. Citation recommendation: Recommendation of papers as citations given a paper s title, abstract, venue of publication and possibly contents. 5 Model 5.1 Training We divided the dataset of 600K samples into train data and test data with a 60%-40% 1 split. The data was shuffled. The label to predict is 1 if the paper p i cites the paper p j, 0 otherwise. At first, we used all features other than the ones pertaining to author similarity and author history. This yielded in a testing accuracy of 86%. Upon adding the author features, the Logistic Regression model yielded in an accuracy of 99.8%. Although this might seem very high and probably a miscalculation, we verified these results by checking the R-score, precision and recall. We also verified that the test prediction and the test labels are matching. To further confirm this, we trained different models on data whose performance can be found below: 5.2 Model Description MODEL We have used three models for evaluation. TEST ACCURACY Logistic Regression 99.8% Deep Neural Network 99.7% Gradient Boosting Regression 99.9% 1. Logistic Regression: A simple logistic regression model was used. The predictions were converted to 1 or 0 by thresholding the predictions. 2. Deep Neural Networks: A two layer deep neural network, with 256 hidden neurons and a dropout factor of 0.5 was trained using Adagrad optimizer. 3. Gradient Boosting Regressor: A simple Gradient Boosting Regressor with decision tree stumps was used with 300 estimators and a max depth of 2 per tree stump Each of these models performed equally well on the test set. The Logistic Regression took the least amount of time to train, while the Gradient Boosting Regressor to the longest. 6 Scope for future work This model can be extended that for recommendation. Top-k papers can be selected using a naive algorithm such as k-means clustering or title similarity. The above model can be used to predict a score to rank these papers and used to recommend the top-n where n k This can also be further extended to predict the number of citations a paper would get in the next n years. 1 Train size:360,000 and Test size:240,000 7

8 References [1] The entire DBLP dataset of bibliography entries in XML format [2] T. Strohman, W. Bruce Croft, D. Jensen. Recommending Citations for Academic Papers. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR07). [3] R. Yan, J. Tang, X. Liu, D. Shan. Citation Count Prediction: Learning to Estimate Future Citations for Literature. [4] S. Rendle. Factorization Machines 8

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted