Supplementary Information - On the Predictability of Future Impact in Science

Size: px

Start display at page:

Download "Supplementary Information - On the Predictability of Future Impact in Science"

Martin Neal
5 years ago
Views:

1 Supplementary Information - On the Predictability of uture Impact in Science Orion Penner, Raj K. Pan, lexander M. Petersen, Kimmo Kaski, and Santo ortunato Laboratory of Innovation Management and conomics, IMT Institute for dvanced Studies Lucca, 55 Lucca, Italy epartment of iomedical ngineering and omputational Science, alto University School of Science, P.O. ox, I-76, inland Laboratory for the nalysis of omplex conomic Systems, IMT Institute for dvanced Studies Lucca, 55 Lucca, Italy S. MTHOS. isambiguation strategy The disambiguation problem is a major hurdle in the analysis of careers, as multiple authors having the same initials, and even the same complete name, can appear as a single author. Here we use disambiguated distinct author data from Thomson Reuters Web of Knowledge, isiknowledge.com using their matching algorithms to identify publication profiles of distinct authors. urther, we use its website portal ResearcherI.com, where users upload and maintain their publication profiles. This ISI online database is host to comprehensive data that is well-suited for developing testable models for scientific impact [, ] and career achievement [, ].. Selection of scientists We seek to compare variations in productivity and impact across distinct scientific fields as well as within fields. To do this we analyze a total 76 scientists divided into broad disciplines: 76, 6 biologists, and 5 mathematicians. ataset : or the selection of high-impact, we aggregate all authors who published in Physical Review Letters (PRL) over the 5-year period 95- into a common dataset. rom this dataset, we rank the scientists using the citations shares metric defined in [], and select the top. Such metric divides equally the total number of citations a paper receives among the k coauthors, and also normalizes the total number of citations by a time-dependent factor to account for citation variations across time and discipline. Hence, for each scientist in the PRL database, we calculate a cumulative number of citation shares received from only their PRL publications. This tally serves as a proxy for his/her scientific impact in all journals. We also choose from our ranked PRL list, randomly, additional highly prolific. The selection criteria for that dataset is that an author must have published between and 5 papers in PRL []. This likely ensures that the total publication [] uthors contributed equally. history, in all journals, be on the order of articles for each author selected. These two lists were curated in such a fashion in a previous publications, and as both of them represent prominent, are merged and analyzed together for some of the analysis. The average h-index of these scientists is h = 56 ±. ataset : or the selection of high-impact cell biologists we choose the top careers based on publications in the journal LL. These scientist s have average h- index of h = 97 ±. ataset : or the selection of high-impact mathematicians we selected the 5 authors with the most publications in the prestigious journal nnals of Mathematics. The average h-index of these scientist is h = ±. We choose only 5 since the variation in collaboration and productivity across mathematics is significantly smaller than in the experimental and theoretical natural sciences. The above three datasets consist of high-impact senior scientists with average academic age ±, ± 7 and 6 ±, respectively. ataset : We also consider relatively young assistant professors from physics. To select the scientists in this dataset, we choose two assistant professors from each of the top 5 U.S. physics and astronomy departments ranked according to the magazine U.S. News. The average h-index of these scientists is h = ± 7. or datasets []-[] we used the istinct uthor Sets function provided by ISI in order to increase the likelihood that only papers published by each given author are analyzed. On a case by case basis, we performed further author disambiguation for each author. Other datasets are comprised of a broad range of scientists with profiles on ResearcherI.com who satisfied the criterion of having more than publications. ataset : This dataset consists of 7 scientists who have published in the field of graphene research. dditionally, we also include notable leaders of this field (Nobel Prize Laureates. K. Geim and K. S. Novoselov). The average h-index of this group is h = ±. ataset : This dataset consists of 6 molecular biology and 76 neuroscience ResearcherI scientists. We assume that such ResearcherI scientists have uploaded a representative (approximately update and complete) set of publications. The average h-index of the scientists in this group is h = 7 ±. The last three datasets consist of relatively young scientists with average academic age ±6, ±7 and 7±9,

2 ..5. h-index.5 academic age biologists. 5 5 R Top Prolific t=ll t=7 t= igure S: Variation of the standard coefficient indicating the change in the contribution of each factor over time for different disciplines. The shaded region indicates the 95% confidence error bars. The standard coefficients are shown for t =ll case, where all career ages were lumped together. respectively. In summary, we group the 76 scientists that we analyze into 6 sets. We downloaded datasets in Jan., and in pr., in Oct., and, in October. biologists 5 5 t=7 h-index biologists 5 5 h-index igure S: Variation of the standard coefficient indicating the change in the contribution of each factor over time for different disciplines. The shaded region indicates the 95% confidence error bars. The standard coefficients are shown for different age cohorts t =, 7. S. LSTI NT RGULRIZTION Regression analysis is a statistical technique for estimating the relationships among the dependent and independent variables. If X represent the independent vari- igure S: The predictive power of the regression model of the h-index for physicist for different career age cohorts (years since first publication t =, 5, 7), with () top and () prolific plotted separately. In the curve for t =ll, all career ages were lumped together. or all the cases, overall regression model is significant (p < 6, calculated from -statistic). The first group show slightly larger predictive power as compared to the second group, especially for large t. able and y represent the dependent variable then the regression model relates these two variables as y = f(x, w), (S) where w are the unknown parameters. If the dependent variable is expected to be a linear combination of the independent variables, then in mathematical notation, the predicted value is expressed as ŷ = w + w x + + w p x p (S) Here, w is the intercept and (w,..., w p ) are the coefficients of the model. In general one can use an ordinary least square method to fit a linear model with coefficients to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Thus, it can be represented as min w Xw y. (S) However, coefficient estimates for ordinary least squares rely on the independence of the model terms. When terms are correlated (also termed as collinear) this method becomes highly sensitive to random errors in the observed response, producing a large variance. urther if the number of features is large, it is possible to reduce the complexity of the model by forcing some coefficients to be small or zero. The elastic net regularization does this by imposing preferred solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent. This method can be mathematically represented as min w n Xw y + λα w + λ( α) w (S)

3 Here,... and... are the L and L norm of the vector respectively. λ is a complexity parameter that controls the amount of shrinkage: the larger the value of λ, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity. The parameter α controls how much collinearity is expected between features. In all our analysis we set α =., whereas the best λ is determined by cross-validation. We also checked that our results are qualitatively similar for other alpha values, say α =.. S. NULL MOLS N RNOMIZ RRS We used two different shuffling methods to create a set of randomized career profiles that do no have any inherent correlations. We use these careers as a benchmark for determining whether the predictability in the regression model is due to correlations in the scientific careers or to the career measure used. Paper shuffle model: In this null model we start by shuffling papers only within a specific career age t. To be specific, let the paper repository {p} t be the set of papers p published among all authors i =... in career year t. To distribute papers to randomized career profiles, we randomly assign papers from the set {p} t until each author has n i (t) papers, where n i (t) is the value observed in his/her real career. This method approximately retains the collaboration patterns of each scientist (and his/her sub-discipline) which are largely responsible for growth in n i (t) over the career []. We tested this shuffling method by pooling together prestigious from datasets [] and [] into a single paper repository. istinguishing papers according to year t we approximately retain the properties of older papers versus younger papers, while washing out the historical, aging, and reputation effects that make empirical career profiles author specific. Using this repository we constructed, synthetic careers profiles used in our analysis. h model: In this null model we first obtain the distribution of single year h-index increases for all careers in a given dataset. Next, we generate a career by constructing a sequence of yearly h-index increases, drawn randomly from the distribution generated in the previous step. This null model is not based on the number of papers published by a specific scientist but rather on the length of career of each scientist, which is kept fixed as in the original dataset. [] Petersen,. M., Wang,. & Stanley, H.. Methods for measuring the citations and productivity of scientists across time and discipline. Phys. Rev., 6 (). [] Petersen,. M., Stanley, H.. & Succi, S. Statistical regularities in the rank-citation profile of scientists. Scientific Reports (). [] Petersen,. M., Jung, W.-S., Yang, J.-S. & Stanley, H.. Quantitative and empirical demonstration of the matthew effect in a study of career longevity. Proc. Natl. cad. Sci. U.S.., (). [] Petersen,. M., Riccaboni, M., Stanley, H.. & Pammolli,. Persistence and uncertainty in the academic career. Proc. Natl. cad. Sci. U.S.. 9, 5 5 ().

4 ..5. h-index.5 academic age Top Prolific. 5 5 t=5 Top Prolific 5 5 h-index Top Prolific 5 5 t=7 h-index Top Prolific 5 5 h-index igure S: Variation of the standard coefficient indicating the change in the contribution of each factor over time for physicist, with top and () prolific modeled separately. The shaded region indicates the 95% confidence error bars for different career age cohorts (years since first publication t =, 5, 7) for both the groups. The spread of the coefficients within a group and the differences in the coefficients across the two groups, indicate that parameters from one group is not suitable to predict the other group...7 biologists mathematicians.6.5 R Physics assistant prof iologists Graphene researcher R igure S5: The predictive power of h-index increments h(t, t) for different disciplines and age-cohorts. or prominent, biologists and mathematicians t =,...,. s the careers of young scientists are short, for the assistant professors in physics, biologists and graphene researchers t =,...,. In all the cases, overall regression model is significant (p < ). or young researchers, as there were very few career with t >, using regression model leads to over fitting and non-significant results (p >.5).

5 5 Std. oeff. Std. oeff. Std. oeff biologists h N journals top journals t=5 t= t= t= igure S6: Variation of the standard coefficient indicating the change in the contribution of each factor over t, the period for the calculation of h. The top panels show the coefficients for early careers (t = ), the mid range careers (t = 5), and late careers (t = ). 6 5 R =.5 6 R = R = R = R =.5. R = igure S7: orrelation between the past and the true future for prominent mathematics careers. Scatter plot of the (,) number of papers, (,) number of total citations, and (,) non-cumulative h-index h(t {T j}) calculated for each author using non-intersecting sets of papers published in early - mid - and late - career periods. The correlation coefficient R for each plot is also shown. The plot also confirms that for mathematicians, fluctuations in impact measure is low and hence many data points lay on top of each other. 6 R = R =. 5 R =.5 5 R = R =. 5 6 R =.5 5 igure S: orrelation between the past and the true future for prominent biology careers. Scatter plot of the (,) number of papers, (,) number of total citations, and (,) noncumulative h-index h(t {T j}) calculated for each author using non-intersecting sets of papers published in early - mid - and late - career periods. 6 5 R = R =.6 R =. 5 5 R = R = R =. 5 5 igure S9: orrelation between the past and the true future for assistant professors in physics. Scatter plot of the (,) number of papers, (,) number of total citations, and (,) non-cumulative h-index h(t {T j}) calculated for each author using non-intersecting sets of papers published in early - mid - and late - career periods.

6 6 R =. 5 6 R =. R =.6 6 R = R = R =. 6 6 R =. 6 5 R =. R =. R = R =. 5 R =. igure S: orrelation between the past and the true future for careers in biology. Scatter plot of the (,) number of papers, (,) number of total citations, and (,) noncumulative h-index h(t {T j}) calculated for each author using non-intersecting sets of papers published in early - mid - and late - career periods. The correlation coefficient R for each plot is also shown. igure S: orrelation between the past and the true future for graphene researchers. Scatter plot of the (,) number of papers, (,) number of total citations, and (,) noncumulative h-index h(t {T j}) calculated for each author using non-intersecting sets of papers published in early - mid - and late - career periods. The correlation coefficient R for each plot is also shown.

arxiv: v1 [physics.soc-ph] 27 Aug 2013

arxiv: v1 [physics.soc-ph] 27 Aug 2013 The Z-index: A geometric representation of productivity and impact which accounts for information in the entire rank-citation profile Alexander M. Petersen 1 and Sauro Succi 2, 3 1 IMT Lucca Institute