An Index for SSRN Downloads

Size: px

Start display at page:

Download "An Index for SSRN Downloads"

Mavis Lawson
5 years ago
Views:

1 An Index for SSRN Downloads Zura Kakushadze 1 Quantigic Solutions LLC, High Ridge Road, #135, Stamford, CT Free University of Tbilisi, Business School & School of Physics 240, David Agmashenebeli Alley, Tbilisi, 0159, Georgia September 5, 2015; revised November 13, 2015 To my mother Ludmila (Mila) Kakushadze on the occasion of her upcoming birthday Abstract We propose a new index to quantify SSRN downloads. Unlike the SSRN downloads rank, which is based on the total number of an author s SSRN downloads, our index also reflects the author s productivity by taking into account the download numbers for the papers. Our index is inspired by but is not the same as Hirsch s h-index for citations, which cannot be directly applied to SSRN downloads. We analyze data for about 30,000 authors and 367,000 papers. We find a simple empirical formula for the SSRN author rank via a Gaussian function of the log of the number of downloads. 1 Zura Kakushadze, Ph.D., is the President and a Co-Founder of Quantigic Solutions LLC and a Full Professor in the Business School and the School of Physics at Free University of Tbilisi. zura@quantigic.com 2 DISCLAIMER: This address is used by the corresponding author for no purpose other than to indicate his professional affiliation as is customary in publications. In particular, the contents of this paper are not intended as an investment, legal, tax or any other such advice, and in no way represent views of Quantigic Solutions LLC, the website or any of their other affiliates. 1

2 1. Introduction In many scientific disciplines e.g., physics the total number of a researcher s citations is considered an important metric of the researcher s scientific impact. However, it does not take into account the author s productivity. An author may write just a single paper garnering many citations. Complementarily, the number of publications is also an important metric. However, it does not account for how important any of the researcher s (possibly numerous) papers are. Hirsch (2005) proposed an index the h-index 3 whose purpose is to combine into a single number both the citations and publications figures. The appeal of the h-index is that, not only is it intuitive and simple to compute, it requires only data for a given author (publications and citations), no cross-sectional (across a sample of authors) data. E.g., INSPIRE High-Energy Physics Literature Database, Thomson Reuters Web of Science and Google Scholar all have utilized the h-index. Also, its variation has been applied to the internet media (Hovden, 2013). Social Science Research Network (SSRN) keeps track of numbers of downloads for each author and paper. The number of SSRN downloads has perhaps not surprisingly in hindsight become an important metric in its own right. SSRN ranks authors and papers by the number of downloads. However, just as with citations, the number of downloads for a given author does not take into account the author s productivity, that is, the number of publications (or papers). In this note we propose a new index for SSRN downloads. Unlike the SSRN downloads rank, which is based on the total number of an author s SSRN downloads, our index also reflects the author s productivity by taking into account the download numbers for the papers. Our index is inspired by but is not the same as the h-index. Thus, if we apply the h-index to SSRN downloads (via simply replacing citations by downloads), we will find that the h-index is mostly equal the number of papers: the bulk of the numbers of downloads exceeds the bulk of the numbers of citations by roughly a few orders of magnitude, so the h-index is uninformative. 4 We circumvent this difficulty by noting that the numbers of downloads in fact, just as the numbers of citations have quasi-log-normal distributions. I.e., the numbers of downloads (citations) are exponential by nature. 3 A scientist has index h if h of his/her papers have at least h citations each, and the other ( h) papers have no more than h citations each (Hirsch, 2005). This is the same as the Eddington number in cycling, with papers replaced by days and citations replaced by miles cycled on a given day. Its inventor, Sir Arthur Stanley Eddington ( ), was a renowned British astronomer, physicist and mathematician and an avid cyclist. 4 The h-index for citations turns out to be informative due to a numerological accident that the number of papers is typically bounded by (low) 2 digits, and the number of citations is bounded by roughly a square of that. The same does not hold for downloads, which roughly are a few orders of magnitude more ubiquitous. We could rescale the numbers of downloads by an overall factor based on cross-sectional data thereby forgoing simplicity. 2

3 We therefore define the following index call it for the lack of a better name for SSRN downloads for a given author: =maxmin(, ) (1) = ( ) (2) Where is the number of downloads for the -th paper by said author; =1,, ; is the number of papers with >0; and are ordered decreasingly:. As we discuss in more detail below, the index appears to produce reasonable statistical output. Let us recast (1) and (2) in plain English. For a given author, take all papers with nonzero downloads. Sort these papers decreasingly with the number of downloads. For each paper, take the integer part (the floor) of the natural log of the number of downloads, which is. Let us call log-downloads. Then the author s index equals if of the papers have at least log-downloads each, and the other ( ) papers have no more than log-downloads each. This is the same as the h-index with the numbers of citations replaced by log-downloads. The index is integer-valued. Currently, it varies from 0 to 9 (we discuss our data in detail in the next section). So, many authors share the same. We can add granularity via noninteger indexes and (cf. (Ruane and Tol, 2008)): =ln( )= ( +1)ln( ) 1+ln( ) (3) Here =ln( ) if < ; otherwise, =. By definition, < +1 and =. Eq. (3) has a simple geometric meaning. First, consider the case <. Once we fix via Eqs. (1) and (2), we linearly interpolate between the points (, ln( )) and ( +1, ln( )) (cf. (Rousseau, 2006, 2014)). On that line there always exists a point (, ), which defines. For = we interpolate twixt (, ln( )) and ( +1, ); being a natural estimate for ln( ). We compute, and for about 30,000 SSRN authors (see Section 2). Table 1 gives the top 20 authors by their values and shows that the ranking of authors by does not coincide with their SSRN rank (similarly to ranking by the h-index v. ranking by citations). Figure 1 illustrates the computation of the, and indexes. Both the and indexes are equally informative and as a matter of preference we can use either or interchangeably. In Section 2 we discuss our data and its statistical characteristics for about 30,000 SSRN authors and over 367,000 papers. We find a simple empirical formula for the SSRN author rank (see Section 2 for the empirical values of the numeric coefficients,, ): ln( ) + ln( )+ ln ( ) (4) 3

4 where is the author s total number of downloads. In section 3 we discuss some statistical properties of our indexes, and, including what is an analog of the ratio h / for the h- index. In Subsection 3.1 we discuss another index for SSRN downloads inspired by the -index (Egghe, 2006) and how it compares to the index. We briefly conclude in Section 4, where we discuss advantages of our proposal, some caveats and (in some cases) how to cope with them. As mentioned above, our indexes for SSRN downloads are related to the h-index (and, thereby, the Eddington number) by virtue of their definitions. In this regard, let us mention some prior works. Here we will not attempt a comprehensive overview of the literature on the h-index and various related indexes for detailed reviews and extensive lists of references, see, e.g., (Alonso et al, 2009), (Egge, 2010) and (Norris and Oppenheim, 2010). Instead, here we focus on prior works with some potential relevance to the approach we follow in this paper. Thus, various nonlinear generalizations of the h-index have been proposed, e.g., the h(2) index (Kosmulski, 2006) and its generalizations (Levitt and Thelwall, 2007; Deineko and Woeginger, 2009). The h(2) index was applied to article downloads as a metric for academic journals (Hua et al, 2009). Logarithms of citations have been considered in other contexts such as ranking (see, e.g., (Lundberg, 2007) and (Stringer, Sales-Pardo and Amaral, 2008)). However, to our knowledge, our log-based indexes, which stem from our observation of the exponential nature of SSRN downloads (and other metrics, including citations see Subsection 3.5) are the first of their kind. Also, our application of indexes of this kind to SSRN downloads is novel and the beauty of working with SSRN downloads data is that it is large and provides lots of statistics. Currently, SSRN uses vanilla downloads (and, secondarily, citations) to rank authors and papers. Variations on the h-index theme include the aforementioned -index of (Egge, 2006) (which is analogous to the h(2) index with citations replaced by cumulative citations), the h -index (Alonso et al, 2010), the h -index (Van Eck and Waltman, 2008), the -index (Jin, 2006; Rousseau, 2006) and its variation the -index (Bornmann, Mutz and Daniel, 2008), the -index and the -index (Jin et al, 2007) (also see (Jarvelin and Persson 2008)), the citation-weighted h-index (Egge and Rousseau, 2008), the contemporary h-index, the trend h-index and the normalized h-index (Sidiropoulos, Katsaros and Manolopoulos, 2007), the dynamic h-type index (Rousseau and Ye, 2008), the tapered h-index (Anderson, Hankin and Killworth, 2008), variants accounting for multiple authors (Shreiber, 2008; Batista et al, 2006; Bornmann and Daniel, 2007; Imperial and Rodriguez-Navarro, 2007; Egge, 2008), and other variations of the h-index. 2. Data In this section we describe our dataset. We downloaded all of our data directly from the SSRN website. The download was automated excepting some manual patches (see below). 4

5 SSRN provides the Top Authors data for the top 30,000 authors 5 based on downloads, both overall and for the last 12 months. We downloaded this data on 08/11/ The data contains links to the authors freely accessible individual webpages with their scholarly papers. Out of the 30,000 webpages 15 turned out to be bad, so our dataset contains 29,985 authors. The Top Authors data consists of 300 webpages, with 100 authors per page. Among other data, for each author these webpages contain the total number of downloads and the total number of papers (overall and for the last 12 months). Table 2 gives summaries for and as well as downloads-per-paper / and their logarithms. Figure 2 plots densities and histograms for ln( ) (overall and 12 mo). The distribution for overall ln( ) is quasi-normal, so the distribution for overall is quasi-log-normal, i.e., it is skewed with a long tail at the higher end. The distribution for 12-mo is even more skewed. It is evident that we should not work with the numbers of downloads but their logs. This conclusion is further supported by the densities (and histograms) for downloads-per-paper in Figure 3. Furthermore, the density for overall ln( / ) in Figure 3 is very close to a Gaussian. In Figure 4 we plot the same density together with a Gaussian curve from a least-squares fit. 7 Here we should remark that the Top Authors data includes all papers in and thereby in the / computation, even those that SSRN does not include in the computation of (such as an author s so-called other papers ). 8 Furthermore, there are regular papers with no downloads for good reasons, e.g., papers with abstracts only and without downloadable PDF files. In many cases this is due to the policies of the journals where such papers are published, which do not permit posting published papers on the internet, including SSRN. Keeping such papers in in the / computation artificially lowers the downloads-per-paper figures. However, these nuances do not affect our conclusion relating to quasi-log-normality. 5 SSRN provides the Top Authors data for all authors and also separately for the law, business and economics authors, with such data for the accounting and finance authors apparently forthcoming. We analyzed the data for all authors. It would be interesting to repeat our analysis for the above 5 disciplines (when they become available). 6 Accessing this data beyond the top 10 authors requires an SSRN account login. The downloaded webpages state that the data was last updated on 07/27/ We use the R function optim() to determine the three parameters in the fit (mean, standard deviation and maximum value); see Figure 4. 8 E.g., in the Top Authors data, P. Fernandez s (SSRN ID 12696) is based on 230 of his papers. His 3 other papers (in the SSRN terminology) do not contribute to the total number of downloads; however, in the Top Authors data =233 and this is the number used in computing / = / All figures are as of the date we downloaded the data (see above). Using =230 would appear to be better. 5

6 2.1. SSRN Rank v. Downloads The density plots in Figures 2 and 3 are rather convincing: the numbers of downloads are exponential by nature. Let be the number of authors in our data. Then the SSRN author rank is given by (where the rank is computed across all authors in the Top Authors data) = +1 rank( ) (5) We plot v. ln( ) and ln( ) v. ln( ) in Figure 5 (overall and 12 mo). Let us start with the overall downloads. Except for the top 3 outliers (M.C. Jensen, P. Fernandez and E.F. Fama; see Table 1), the lower-left curve in Figure 5 is almost parabolic. A quadratic curve fits the data very well indeed. The results for the fit using a polynomial regression are given in Table 3. So, we have the empirical formula (4) with the numeric coefficients 4.704, 2.009, Adding a cubic term does not improve the fit. The inflection point in the upper-left curve in Figure 5 occurs around The results for the quadratic fit for the 12-mo downloads is summarized in Table 4. The empirical formula (4) also holds in this case with the numeric coefficients 11.18, 0.492, A cubic term does not improve the fit. The inflection point in the upper-right curve in Figure 5 occurs around Top Papers Data SSRN provides the Top Papers data for the top 10,000 papers 10 based on downloads, both overall and for the last 12 months. We downloaded this data on 08/13/ The Top Papers data consists of 100 webpages, with 100 papers per page. Among other data, for each paper these webpages contain the number of downloads (overall and 12 mo). Table 5 gives summaries for and its log. Figure 6 plots densities and histograms for ln (overall and 12 mo). The results are qualitatively similar to those for ln( ); see Table 2 and Figure 2. Let be the number of papers in our data. Then the SSRN paper rank is given by (where the rank is computed across all papers in the Top Papers data) = +1 rank( ) (6) 9 There are more of what can be deemed as outliers in Figure 5, lower-right corner: P. Fernandez (139,290), M.C. Jensen (66,244), M.O. Jackson (41,580), M.T. Faber (37,553), A. Damodaran (34,523), C.R. Harvey (32,171), H.M. Mialon (32,072), E.F. Fama (32,060), C.R. Sunstein (31,681), B. Bartlett (31,583) and A.M. Francis (31,517), with the total number of downloads in the last 12 months given in the parentheses. 10 Unlike the Top Authors data, the Top Papers data does not appear to be available by (any) discipline. 11 Accessing this data beyond the top 10 papers requires an SSRN account login. The downloaded webpages state that the data was last updated on 08/09/

7 We plot v. ln and ln( ) v. ln in Figure 7 (overall and 12 mo). Let us start with the overall downloads. As above, except for several top outliers, 12 the lower-left curve in Figure 7 is almost parabolic. A quadratic curve fits the data very well indeed. The results for the fit using a polynomial regression are given in Table 6. So, we have the empirical formula (4) (where is replaced by and is replaced by ) with the numeric coefficients 4.375, 1.952, The inflection point in the upper-left curve in Figure 7 occurs around 650. The results for the quadratic fit for the 12-months downloads are summarized in Table 7, so we have 16.78, 1.331, in the formula (4). 14 However, the curvature is negligible, so we can use a linear fit instead by setting = 0 in Eq. (4), for which the results are provided in Table 8, and we have ln + ln with 17.95, Data from SSRN Author Pages As mentioned above, the Top Authors data contains links to the 30,000 authors individual webpages. We downloaded these webpages in an automated fashion on 08/16/2015 and 08/17/ The data is essentially structured, with a few caveats. E.g., the Posted: date is not always shown, which complicates parsing. Also, the default ordering of the papers is by the decreasing number of downloads; however, occasionally this ordering is not followed with no clear pattern. Furthermore, papers that have been revised and are still under review by SSRN are moved to the bottom of the list with no Last Revised: date. However, each webpage has a field showing the total number of downloads, so a simple sanity check is that summing the number of paper downloads over all papers with non-zero/non-empty downloads fields should produce the total number of downloads. Out of 29,985 good webpages (see above), all but 256 satisfied this criterion with straightforward parsing. An additional heuristic further reduced this number to 78. We manually checked and patched the data for these remaining 78 pages on 08/18/2015. However, the laborious and time-consuming sourcing resulted in high quality data. 12 The following papers are the apparent outliers (the format is (authors(s), year), ): (Faber, 2007), 152,242; (Solove, 2007), 150,263; (Jensen and Meckling, 1976), 108,410; (Jackson, 2011), 96,296; (Fama, 1998), 83,174; (Girgis, George and Anderson, 2010), 73,622. We include these papers in the References so the reader can get a flavor on the spectrum of the topics and authors of the top downloaded papers (overall). 13 The coefficients and in this case are close to those for the overall total downloads (see above and Table 3). 14 The following papers are the apparent outliers (the format is (authors(s), year), ): (Francis and Mialon, 2014), ; (Jackson, 2011), 31476; (Bartlett, 2015), 30538; (Faber, 2007), 21531; (Fama and French, 2015), 12081; (Roche, 2011), Again, we include these papers in the References so the reader can get a flavor on the spectrum of the topics and authors of the top downloaded papers (12 mo). 15 The downloads take some time, even for a fast machine with a fast internet connection, which is what we used. We mention this to emphasize that the data is not 100% synchronized as, e.g., SSRN updates download counts in real-time. This asynchronicity is unavoidable with downloads; however, its effect on our analysis here is small. 7

8 For the reasons mentioned in Section 2, we drop all papers labeled as other papers (SSRN does not include downloads for such papers in the total download count), and also all papers with empty downloads fields. For each author we then have a vector, =1,, with =, same as the total number of downloads on the author s webpage. However, can be less than the total number of papers on the webpage as we omit the papers with empty downloads fields. Using this data we compute, and via Eqs. (1),(2) and (3). 3. Index Properties The number of papers (as defined above) across all authors in our database is 367,478. However, by definition, only a fraction of these papers contribute to the index : the number of such papers is simply a sum (across all authors) over the values of the integer index and turns out to be 112,793, or about 30.7%. Table 9 gives cross-sectional (across all authors in the SSRN Top Authors data) summaries for and the ratio = / (with NAs omitted). The =1 cases are rather ubiquitous, to wit, 8,572. However, these are mostly the authors with low paper counts. We have the following statistics for the number of occurrences of = according to the value: 3,326 for =1; 2,338 for =2; 1,842 for =3; 848 for =4; 205 for =5; 12 for =6; 0 for =7; 1 for =8; and 0 for =9 (see the histogram in Figure 8). The outlier = =8 corresponds to the author G. Feiger (SSRN ID ), whose =(47043,7971,7205,6438,6004,5607,5397,5276,3145). The index means that = papers have at least downloads each, i.e., we have and we can define the ratio = (7) We have 1. Summaries of and ln( ) are given in Table 9. The large values of are mostly due to the authors with low paper counts but substantial numbers of downloads. In Figure 8 we plot histograms for ln( ), and the quantity =ln( / )/ also summarized in Table 9. The significance of the quantity is that, if all (see Eq. (2)), then = and ln( ). The reason why the index works numerologically, that is, produces reasonable results, is that the bulk of the values of are of order 1. Had these values been much higher, most values of would equal and this index would be uninformative. The analog of the quantity for the h-index is / and this quantity is mostly large for SSRN downloads, which is precisely why the h-index does not work numerologically for SSRN downloads. In contrast, for citations the bulk of the values of / (here is the total number of citations) is around 3-5 for physics papers (Hirsch, 2005) focuses on, which is the reason the h-index works reasonably well numerologically for citations in that particular field. 8

9 3.1. An Alternative Index Suppose an author has index. This index knows nothing about the detailed structure of the downloads for the papers with =, only that =exp( ) (we are assuming that the papers are ordered with decreasing ). To give more weight to the papers with more downloads, we can consider an alternative index call it for the lack of a better name for SSRN downloads for a given author: =maxmin(, ) (8) = ( ) (9) = 1 (10) I.e., the integer-valued index is based on the average number of downloads for the first papers (as opposed to the number of downloads for the -th paper). 16 As in Section 1, we can define a non-integer index via (see Figure 9) =ln( )= ( +1)ln( ) 1+ln( ) (11) Here =ln( ) if < ; otherwise, =. By construction, < +1 and =. The analog of in Eq. (7) is = / ; however, the analog of is trivial: for = we have =1 as = / in this case. By construction, the number of papers that contribute to the index is higher (compared with the index ); it is simply a sum (across all authors) over the values of the integer index and turns out to be 128,494, or about 35.0%. Table 10 gives cross-sectional (across all authors in the SSRN Top Authors data) summaries for, ln( ), and the ratio = / (with NAs omitted). As above, the =1 cases are rather ubiquitous, to wit, 11,343, and mostly correspond to the authors with low paper counts. We have the following statistics for the number of occurrences of = according to the value (currently, the maximum value of is 10): 3,326 for =1; 2,395 for =2; 2,206 for =3; 1,868 for =4; 1,141 for =5; 356 for =6; 42 for =7; 8 for =8; 1 for =9; and 0 for =10 (see the histogram in Figure 10). The outlier = =9 corresponds to the author S. Zafron (SSRN ID ), whose 9 papers have the downloads vector =(21609,21148,17194,11034,8930,1308,1134,256,231). Table 11 lists the top 20 authors by the index (cf. Table 1 for the top 20 authors by the index ). 16 This is analogous to the -index (Egghe, 2006) for citations. Here we have (the integer part of) the log in Eq. (9), which makes all the difference. Just as the h-index, the -index is uninformative when applied to SSRN downloads. 9

10 3.2. Are Our Indexes Informative? One of the critiques of the h-index was set forth in (Yong, 2014). In a nutshell, it boils down to the fact that we can think of citations being partitioned into papers, and then the h- index is the side-length of the so-called Durfee square (which is the largest h h square that fits into the so-called Young diagram for said partition (see, e.g., (Anderson, Hankin and Killworth, 2008)). For given and there is a finite range of what values the h-index can take. Assuming equal probabilities (i.e., no additional information) we can define the expected value of the h-index. When is large, there is an asymptotic formula for this expected value (Canfield, Corteel and Savage, 1998), which Yong (2014) proposes to use as the rule-of-thumb estimate for the h-index: h=( 6ln(2)/ ) Yong then argues that the information in the h-index beyond what is already in is limited. In this regard, here we also should ask whether our indexes and are informative (beyond what is encoded in ). Since our indexes are based on logs of the numbers of downloads, the combinatorial tricks do not appear to be directly applicable. We will therefore take an empirical approach. We plot the indexes and as well as and v. ln( ) in Figure 11. There is an apparent linear ln( ) component in these indexes. Linear regressions of the indexes over ln( ) with the intercept are summarized in Tables Adding another explanatory variable ln( ) improves the fits see Table 16. Evidently, the dependence on ln( ) is not the end of the story: there is more information encoded in these indexes beyond what is already in ln( ) Twelve-months Indexes One shortcoming of the h-index is that, for a given author, it cannot decrease. A retired author can have a high h-value without writing a single new paper. By definition, the same applies to our indexes. In the case of the h-index, one can implement a weighting scheme, whereby older papers are given less weight (Sidiropoulos, Katsaros and Manolopoulos, 2007). The same idea can be applied to our indexes. However, the data for the age of the downloads is not readily available, at least not publically, so any empirical analysis presently is out of reach. Nonetheless, not all is lost. We can compute our indexes for the last 12-months downloads. The author webpages do not separately provide the last 12-mo download data. We circumvent this difficulty by utilizing the Top Papers data, which contains the top 10,000 most downloaded papers for the last 12 months together with their last 12-mo download numbers. 17 For each 17 It also contains data for the overall downloads. However, the Top Papers data is ordered by the rank based on the last 12-mo downloads. This implies that this data may not contain the overall top 10,000 most downloaded papers. The same applies to the Top Authors data, which is also ordered by the rank based on the last 12-mo downloads. However, it is not unreasonable to focus on the papers that have been downloaded more recently. 10

11 author contributing to these 10, papers we can extract the author s papers with the last 12-mo download numbers and therefore compute our indexes. Summaries for the 12-mo and index values are given in Table 17. Tables 18 and 19 give the top 20 authors by the 12- mo and index values, respectively. The numbers of papers (column 6) in Tables 18 and 19 are lower than those in Table 1. It is not surprising that some papers do not make it to the top 10,000 most downloaded papers. This causes the total numbers of downloads in the last 12 months in Tables 18 and 19 to be lower than those reported in the Top Authors data (not shown), so the 12-mo rank in Tables 18 and 19 (column 7), which is based on the total numbers of downloads in Tables 18 and 19 (column 5), is not always the same as the 12-mo rank in the Top Authors data. However, we expect that the omitted low-downloads papers hardly affect the index values. The rank is not critical here and is shown solely for comparison purposes. We can analyze the time-dependence of our indexes in more detail using the SSRN author webpage data. It contains the Posted: field (with some missing cases see above), the date a paper was originally posted on SSRN. We parsed the 30,000 author webpages (see above) and for each author identified the earliest of the Posted: dates, which we use to measure, the authors SSRN career lengths in years (for simplicity we set 1 mo = 1/12 yr, 1 day = 1/30 mo). Plots of ln( ), ln( / ), and v. ln( ) are given in Figure There is no statistically significant relation between ln( / ) and ln( ); see Table On average, there is linear growth in ln( ), and (at the upper end of the values) with ln( ), which is expected; see Table 20. Adding ln( ) as a third explanatory variable in the regressions in Table 16, however, has a negligible effect on the fits. The ln( ) dependence is not the main driver Why Do Our Indexes Work Numerologically? In our definition of the index in Eqs. (1) and (2) (and, consequently, in our definitions of the indexes as well as and ) we chose the natural logarithm ln( ) as opposed to a logarithm log ( ) with another base. How come? The answer is rather prosaic. We chose the natural logarithm because it works well numerologically. Let us elaborate on this point. 18 One paper does not list the author(s) in Top Papers, so we end up with 9,999 papers with 11,871 authors. 19 We downloaded the 30,000 SSRN author webpages to extract the Posted: fields on August 28-29, This does not affect the actual values of ; however; there were more bad webpages, 27 instead of 15 (see above). 20 So that the statistic is meaningful, we take 1. After dropping NAs (see above), we have 28,552 datapoints. The summary for reads: Min = 1.003, 1st Quartile = 4.111; Median = 7.467; Mean = 8.415; 3rd Quartile = ; Max = The maximum corresponds to the author J. Pontiff (SSRN ID 17153), with a posting on 5/9/ Cf. the so-called -quotient (the h-index over the number of years); see (Hirsch, 2005). 11

12 Suppose we have objects (e.g., papers), each of which is characterized by a count of sorts (e.g., a number of citations or downloads, etc.). Let us call these counts, =1,,. Let us further assume that the counts are exponential by nature (just as is the case with downloads), i.e., the cross-sectional distribution of the total counts = across the object owners (e.g., authors) is (quasi-)log-normal. Suppose we wish to construct an index along the lines of our index. We can define this index via Eq. (1) with defined more generally as follows: 22 = ( ) (12) Here is an overall normalization factor, which for SSRN downloads in Eq. (2) we have set to 1 for the reasons we will explain momentarily. More generally, need not be 1. The choice of the base in the logarithm is then subsumed in as log ( )=ln( ) / ln( ). I.e., the choice of the base of the logarithm is equivalent to the choice of the overall normalization factor. In the context of the h-index this is analogous to the h -index of (Van Eck and Waltman, 2008); also see (Waltman and Van Eck, 2009). Our in Eq. (12) is analogous to in the h -index. So, what should we choose as our factor? There is no magic prescription here. There are two evident guiding principles: i) that the resulting index (and also and all the other related indexes) should be informative, and ii) simplicity. E.g., if we choose too high, the index will mostly equal the number of objects and thereby be uninformative. If we choose too low, then the index will mostly equal 0 or 1 and thereby also be uninformative. A choice of that avoids such extremes is such that the bulk of the values of the product is of order 1, where, as above, =ln( / ) /. E.g., we can set =1 / median( ), albeit this is not the only choice. For the overall SSRN downloads the bulk of the values of is of order 1 (see Table 9), which is why we have chosen =1 in Eq. (2), or, equivalently, the natural logarithm and not any other base. As mentioned above, this is the numerological reason why our indexes work well for the overall SSRN downloads. Also, while, e.g., median( ) 0.7 (see Table 9), we have chosen =1 based on a further consideration of simplicity, so that only each author s data is required to compute his/her indexes (but no cross-sectional data across a sample of authors). Here the following remark is in order. Basing on the quantity =ln( / ) / makes sense only if ln( / ) is essentially normally distributed and the data is not inundated with =1 (and, more generally, low ) datapoints. Our dataset based on the Top Authors data satisfies these criteria: as we discussed above, the density of ln( / ) for the overall downloads is almost Gaussian (see Figure 4), among the 29,979 datapoints (there are 29,985 good webpages (see above), 6 of which contain no papers) there are only 3,326 papers with =1 and 2,399 papers with =2, and the paper count statistics is reasonable (see 22 Let us note that rescaling by some factor would merely shift the range of values of the index. 12

13 Table 2). If ln( / ) itself has a highly skewed distribution or the data mostly contains low points, then blindly relying on =ln( / ) / would produce nonsensical results. E.g., median( ) 5.3 for the 9,999 papers discussed in Subsection 3.3 based on the Top Papers data, despite the fact that the bulk of the 12-mo download numbers are roughly 5-7 times less numerous than the bulk of the overall download numbers. This is due to the fact that the majority of the 11,871 authors of these 9,999 papers have =1 and =2 (see Table 17). If we remove the =1 and =2 datapoints, happily we are left with only 1,310 authors with the bulk of the values of of order 1 (to wit, median( ) 1.4) Can We Apply Our Indexes to Citations? As mentioned above, even citations are essentially exponential by nature. In this regard, we believe it would make sense to apply our ideas here to citations as well. This interesting in its own right topic is outside of the scope of this paper, so we will not delve into it too deeply and only give a bird s-eye view. A detailed empirical analysis would be required to see if it works. If we apply, say, the index as defined in Eqs. (1) and (2) directly to citations, it may not work as well numerologically. This is because the numbers of citations are a few orders of magnitude lower than the numbers of SSRN downloads. Thus, as of 9/4/2015, M.C. Jensen has the most SSRN downloads, to wit, 830,936, while according to INSPIRE (see above) E. Witten s (high energy physics) total number of citations is 118,374 with a total of 332 citable papers. 23 As of 9/4/2015, (Faber, 2007) is the most downloaded SSRN paper with 158,804 downloads, while the most cited paper in high energy physics (per INSPIRE) is (Maldacena, 1997) with 10,996 citations. As is customary in high energy physics, we exclude (Particle Data Group Collaboration, 2014) with 50,004 citations as of the end of 2014, which is a handbook of elementary particles and traditionally garners most citations in high energy physics year after year. Based on the above numbers, we can superficially estimate that there is roughly orders of magnitude difference between SSRN downloads and citations, albeit a more detailed analysis (which is outside of the scope of this paper) would be required to get more precise bulk numbers. In any event, if we apply our indexes with =1 to citations, we can expect that they will produce reasonable results at the higher end (i.e., for highly cited authors and papers), and a nontrivial might be required to have informative indexes for lower citation count trenches. Our log-based indexes should be applicable beyond SSRN downloads and citations, e.g., for internet media downloads, which apparently are also exponential in nature (cf. (Hoven, 2013)). However, depending on a type of downloads and the bulk values thereof, the factor in Eq. (12) may have to be chosen away from 1 for our indexes to work well numerologically. 23 Using Harzing s Publish or Perish software (version ) with Google Scholar as the data source gives inflated figures: 161,969 citations and 574 papers. In our experience, INSPIRE does undercount citations, though. 13

14 4. Conclusions Let us start by tying up a lose end, so to speak. In the empirical regression in Table 6 we used the Top Papers data, which only contains 10,000 datapoints. We did so for illustrative purposes as downloading this data, which amounts to downloading only 100 webpages, is much less arduous than downloading 30,000 individual author webpages. However, the latter already contain the data (much more of it, 367,478 papers) required in the regression in Table 6. We give the results for this regression based on the author webpage data in Table 21, which are qualitatively similar to the results in Table 6, and we still have the empirical formula (4). 24 As mentioned in Table 17, the Top Papers data sample is actually small, despite 9,999 papers it contains, so any results obtained using the Top Papers data should be taken with a grain of salt. Let us now discuss possible variations and generalizations of our indexes and this topic invariably overlaps with caveats. E.g., as in the case of the h-index, typical values of our indexes naturally will vary from discipline to discipline. One way to deal with this is to simply compute the indexes separately for each discipline. As mentioned above, SSRN provides the Top Authors (but not the Top Papers) data broken down by some disciplines (law, business and economics), with such data for other disciplines (to wit, accounting and finance) apparently forthcoming. It would be interesting to statistically analyze our indexes by each discipline. 25 One way the disparity in the h-index across different disciplines has been dealt with is via normalizing the h-index (or the citations) within each discipline (e.g., via dividing by a mean for the discipline). 26 A similar approach can be applied to our indexes as well. So, can we simply rescale SSRN downloads by some factor and apply the h-index to the rescaled downloads (as, e.g., for the internet media in (Hovden, 2013))? A natural choice for such a factor is, e.g., the median of = / (the distribution is too skewed to use the mean, which is much higher), which is median( ) (based on the Top Authors data). However, if we apply the h-index to = / 30, expectedly, we get an index with a highly skewed distribution. There is no escaping the fact that SSRN downloads are exponential by nature as we hopefully convincingly argued above based on the empirical data. Logs of the numbers of 24 With replaced by and replaced by (see Table 21). The inflection point in the curve v. ln( ) occurs at around 70. Let us note that the author webpage data contains the numbers of overall downloads for each paper by each of the 30,000 authors, but not for the last 12 months. So, we have no choice but to use to the Top Papers data for the regressions in Tables 7 and We have refrained from doing so here as the breakdown by all disciplines is currently unavailable. Let us note that the author webpage data can also be split by discipline as the SSRN IDs are obtained via the Top Authors data. 26 See, e.g., (Anauati, Galiani, and Gálvez, 2014), (Batista et al, 2006), (Bornmann and Daniel, 2008), (Iglesias and Pecharromán, 2007), (Kaur, Radicchi and Menczer, 2013), Podlubny, 2005), (Podlubny and Kassayova, 2006). 14

15 downloads not the numbers of downloads are a natural measure. And this is essentially our key observation. Once we accept this fact, the indexes we propose are natural, with possible tweaks (e.g., the linear extrapolation between the points (, ln( )) and ( +1, ln( )) in Eq. (3) can be tweaked, but this is all minutiae). Just as the h-index, our indexes use only each author s data, but no cross-sectional (across a sample of authors) data, which makes them easy to compute. Rescaling by some factor that requires cross-sectional data to compute would complicate the calculation of an index. And, once again, such rescaling does nothing to address the exponential nature of SSRN downloads, while our indexes are built on that very premise. Several variations of and indexes complementary to the h-index have been proposed, e.g., the -index (Egghe, 2006), whose analog for SSRN downloads is our index. A geometric mean of the h-index and the -index the so-called h -index (Alonso et al., 2010) is a composite index (also, see, e.g., (Franceschini and Maisano, 2011)). An evident application to our indexes would be to consider geometric means of the and or and indexes. Another point worth mentioning relates to the overall normalization factor we introduced in Subsection 3.4. In Eq. (12) it is implicitly assumed to be a constant. However, there is no reason (other than forgoing simplicity, that is) why we could not consider non-constant ( ) in Eq. (12) instead, where ( ) is some in many cases, likely relatively slowly varying function. Self-citations affect the total number of citations as well as the h-index. For citations this is relatively simple to deal with: one can simply remove self-citations. E.g., INSPIRE provides such functionality. Analogously, self-downloads can be a nuisance for SSRN downloads (see, e.g., (Edelman and Larkin, 2014)) and thereby our indexes. Dealing with self-downloads is harder, perhaps even internally at SSRN. This is a caveat. Also, some papers are not available for download from SSRN (due to journal policies see above with exceptions often granted to established authors). Naturally, as with any index, there are caveats. Nonetheless, our indexes are novel and it would be interesting if SSRN could analyze and perhaps even implement them. Acknowledgments I would like to thank Ludo Waltman for reading a draft of the manuscript and valuable comments and suggestions that have helped improve it, and Blaise Cronin for encouragement. References Alonso, S.; Cabrerizo, F.; Herrera-Viedma, E.; Herrera, F. (2010) h-index: A review focused in its variants, computation and standardization for different scientific fields. Journal of Informetrics 3(4):

16 Alonso, S.; Cabrerizo, F.; Herrera-Viedma, E.; Herrera, F. (2010) hg-index: a new index to characterize the scientific output of researchers based on the h- and g-indices. Scientometrics 82(2): Anderson, T.; Hankin, R; Killworth, P. (2008) Beyond the Durfee square: Enhancing the h-index to score total publication output. Scientometrics 76(3): Anauati, M.V.; Galiani, S.; Gálvez, R.H. (2014) Quantifying the Life Cycle of Scholarly Articles Across Fields of Economic Research. Available at SSRN: Bartlett, B. (2015) How Fox News Changed American Media and Political Dynamics. Available at SSRN: Batista, P.D.; Campiteli, M.G.; Konouchi, O.; Martinez, A.S Is it possible to compare researchers with different scientific interests? Scientometrics 68(1): Bornmann, L.; Daniel, H. (2007) Convergent validation of peer review decisions using the h index: Extent of and reasons for type I and type II errors. Journal of Informetrics 1(3): Bornmann, L.; Daniel, H. (2008) What do citation counts measure? A review of studies on citing behavior. Journal of Documentation 64(1): Bornmann, L..; Mutz, R.; Daniel, H. (2008) Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine. Journal of the American Society for Information Science and Technology 59(5): Canfield, E.R; Corteel, S.; Savage, C.D. (1998) Durfee polynomials. The Electronic Journal of Combinatorics 5 (1998) #R32. Edelman, B.G.; Larkin, I. (2014) Social Comparisons and Deception Across Workplace Hierarchies: Field and Experimental Evidence. Organization Science (Forthcoming). Available at SSRN: Deineko, V.G.; Woeginger, G.J. (2009) A new family of scientific impact measures: The generalized Kosmulski-indices. Scientometrics 80(3): Egghe, L Theory and practice of the g-index. Scientometrics 69(1): Egghe, L. (2008). Mathematical theory of the h- and g-index in case of fractional counting of authorship. Journal of the American Society for InformationScience and Technology 59(10):

17 Egghe, L The Hirsch index and related impact measures. Annual Review of Information Science and Technology 44(1): Egghe, L.; Rousseau, R. (2008) An h-index weighted by citation impact. Information Processing & Management 44(2): Faber, M.T. (2007) A Quantitative Approach to Tactical Asset Allocation. Journal of Wealth Management 9(4): Available at SSRN: Fama, E.F. (1998) Market Efficiency, Long-Term Returns, and Behavioral Finance. Journal of Financial Economics 49(3): Available at SSRN: Fama, E.F.; French, K.R. (2015) A Five-Factor Asset Pricing Model. Journal of Financial Economics 116(1): Available at SSRN: Franceschini, F.; Maisano, D. (2011) Criticism on the hg-index. Scientometrics 86(2): Francis, A.M.; Mialon, H.M. (2014) A Diamond is Forever and Other Fairy Tales: The Relationship between Wedding Expenses and Marriage Duration. Available at SSRN: Girgis, S.; George, R.; Anderson, R.T. (2010) What is Marriage? Harvard Journal of Law and Public Policy 34(1): Available at SSRN: Hirsch, J.E. (2005) An index to quantify an individual s scientific research output. Proceedings of the National Academy of Sciences 102(46): Hovden, R. (2013) Bibliometrics for Internet media: Applying the h-index to YouTube. Journal of the American Society for Information Science and Technology 64(11): Hua, P.-h.; Rousseau, R.; Sun X.-k.; Wan, J.-k. (2009) A Download h ( ) -Index as a Meaningful Usage Indicator of Academic Journals. In: Larsen, B.; Leta, J. (eds.) ISSI th International Conference on Scientometrics and Informetrics. Rio de Janeiro: BIREME and Federal University of Rio de Janeiro, pp Iglesias, J.E.; Pecharromán, C Scaling the h-index for Different Scientific ISI Fields. Scientometrics 73(3): pp Imperial, J.; Rodriguez-Navarro, A. (2007) Usefulness of Hirsch s h-index to evaluate scientific research in Spain. Scientometrics 71(2): Jackson, M.O. (2011) A Brief Introduction to the Basics of Game Theory. Available at SSRN: 17

18 Jarvelin, K.; Persson, O. (2008) The DCI index: discounted cumulated impact-based research evaluation. Journal of the American Society for Information Science and Technology 59(9): Jensen, M.C.; Meckling, W.H. (1976) Theory of the Firm: Managerial Behavior, Agency Costs and Ownership Structure. Journal of Financial Economics 3(4): Available at SSRN: Jin, B. (2006) H-Index: An evaluation indicator proposed by scientist. Science Focus 1(1): 8-9. Jin, B. (2007) The AR-index: Complementing the h-index. ISSI Newsletter 3(1): 6. Jin, B.H.; Liang, L.L.; Rousseau, R.; Egghe, L. (2007) The - and -indices: Complementing the h-index. Chinese Science Bulletin 52(6): Kaur, J.; Radicchi, F.; Menczer, F. (2013) Universality of scholarly impact metrics. Journal of Informetrics 7(4): Kosmulski, M. (2006) A new Hirsch-type index saves time and works equally well as the original h-index. International Society for Scientometrics and Informetrics Newsletter 2(3): 4-6. Levitt, J.M.; Thelwall, M. (2007) Two new indicators derived from the h-index for comparing citation impact: Hirsch frequencies and the normalized Hirsch index. In: Torres-Salinas, D; Moed, H.F. (eds.) Proceedings of the 11th Conference of the International Society for Scientometrics and Informetrics, Vol 2. Madrid, Spain: Spanish Research Council (CSIC), pp Lundberg, J. (2007) Lifting the crown citation z-score. Journal of Informetrics 1(2): Maldacena, J.M. (1997) The Large N limit of superconformal field theories and supergravity. (Nov. 1997); Adv. Theor. Math. Phys. 1998, 2(2): ; Int. J. Theor. Phys. 1999, 38(4): Norris, M.; Oppenheim, C. (2010) The h-index: a broad review of a new bibliometric indicator. Journal of Documentation 66(5): Particle Data Group Collaboration (Olive, K.A. et al.) (2014) Review of Particle Physics. Chin. Phys. C38: , 1676 pp. Podlubny, I Comparison of scientific impact expressed by the number of citations in different fields of science. Scientometrics 64(1):

19 Podlubny, I.; Kassayova, K Law of the constant ratio. Towards a better list of citation superstars: compiling a multidisciplinary list of highly cited researchers. Research Evaluation 15(3): Roche, C.O. (2011) Understanding the Modern Monetary System. Available at SSRN: Rousseau, R. (2006) Simple models and the corresponding h- and g-index. Available at E-LIS: Rousseau, R. (2014) A note on the interpolated or real-valued h-index with a generalization for fractional counting. Aslib Journal of Information Management 66(1): Rousseau, R.; Ye, F.Y. (2008). A proposal for a dynamic h-type index. Journal of the American Society for Information Science and Technology 59(11): Ruane, F.; Tol, R.S.J. (2008) Rational (successive) h-indices: An application to economics in the Republic of Ireland. Scientometrics 75(2): Schreiber, M. (2008) A modification of the h-index: The h -index accounts for multi-authored manuscripts. Journal of Informetrics 2(3): Sidiropoulos, A.; Katsaros, C.; Manolopoulos, Y Generalized h-index for Disclosing Latent Facts in Citation Networks. Scientometrics 72(2): Solove, D.J. (2007) I ve Got Nothing to Hide and Other Misunderstandings of Privacy. San Diego Law Review 44(4): Available at SSRN: Stringer, M.J.; Sales-Pardo, M.; Amaral, L.A.N. (2008) Effectiveness of Journal Ranking Schemes as a Tool for Locating Information. PLoS One 3(2): e1683. Van Eck, N.J.; Waltman, L. (2008) Generalizing the h- and -indices. Journal of Informetrics 2(4): Waltman, L.; Van Eck, N.J. (2009) A simple alternative to the h-index. ISSI Newsletter 5(3): Yong, A. (2014) Critique of Hirsch s Citation Index: A Combinatorial Fermi Problem. Notices of the American Mathematical Society 61(9):

20 Tables Author Name, SSRN ID Total # of Downloads # of >0 Papers SSRN Rank Michael C. Jensen, Eugene F. Fama, Pablo Fernandez, Kenneth R. French, Attilio Meucci, Aswath Damodaran, William N. Goetzmann, Lucian A. Bebchuk, Shahin Shojai, Kostas Koufopoulos, Andrew W. Lo, Daniel Kaufmann, Nassim Nicholas Taleb, Werner Erhard, Christian Leuz, Stephen H. Penman, Bernard S. Black, Aart Kraay, Ignacio Velez-Pareja, Campbell R. Harvey, Table 1. Top 20 SSRN authors by the index. The 6th column is the number of papers with at least 1 download. The 2nd column is rounded down to the 3rd decimal. The 4th column is rounded down to the nearest integer. The SSRN rank is based on the total number of downloads. All statistics are as of the date(s) of our downloads of the data (see Section 2). 20

21 Quantity Min. 1 st Quartile Median Mean 3 rd Quartile Max. (overall) ln( )(overall) (overall) ln( ) (overall) / (overall) ln( / ) (overall) (12 mo) ln( ) (12 mo) (12 mo) ln( ) (12 mo) / (12 mo) ln( / ) (12 mo) / ln( / ) Table 2. Cross-sectional (across all authors in the SSRN Top Authors data) summaries for the total number of downloads and its log, the total number of papers and its log, and downloads-per-paper / and its log, both overall and for the last 12 months. In for the last 12 months we drop all =0 cases. We give the numbers as rounded by R. E.g., the maximum numbers of downloads overall and in the last 12 months actually are and , respectively. The / figures are already rounded to the nearest integer in the SSRN Top Authors data. The bottom two rows summarize the ratio of the overall total number of downloads to the 12-mo total number of downloads and the log of this ratio. Regression: ln( ) ~ ln( )+ln ( ) Estimate Standard error t-statistic Overall statistics Intercept ln( ) ln ( ) Multiple/Adjusted R-squared F-statistic Table 3. Summary (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data) polynomial regression of ln( ) over ln( ) and ln ( ) with the intercept. The regression formula reads lm(y ~ x + I(x^2)) in R notations, where =ln( ) and =ln( ). Here the rank and are based on the overall downloads. We keep the outliers in the regression not to inflate the statistics. 21

22 Regression: Estimate Standard t-statistic Overall ln( ) ~ ln( )+ln ( ) error statistics Intercept ln( ) ln ( ) Multiple/Adjusted R-squared F-statistic Table 4. Same as Table 3 with the rank and based on the last 12-months downloads. Quantity Min. 1 st Quartile Median Mean 3 rd Quartile Max. (overall) ln( )(overall) (12 mo) ln( ) (12 mo) Table 5. Cross-sectional (across all papers in the SSRN Top Papers data) summaries for the number of downloads and its log, both overall and for the last 12 months. We give the numbers as rounded by R. E.g., the maximum numbers of downloads overall and in the last 12 months actually are and 31862, respectively. Regression: ln ~ ln +ln ( ) Estimate Standard error t-statistic Overall statistics Intercept ln ln ( ) Multiple/Adjusted R-squared F-statistic Table 6. Summary (using the function summary(lm())in R) for the cross-sectional (over all papers in the SSRN Top Papers data) polynomial regression of ln( ) over ln and ln ( ) with the intercept. The regression formula reads lm(y ~ x + I(x^2)) in R notations, where =ln( ) and =ln. Here the rank and are based on the overall downloads. We keep the outliers in the regression not to inflate the statistics. 22

23 Regression: ln ~ ln +ln ( ) Estimate Standard t-statistic error Intercept ln ln ( ) Overall statistics Multiple/Adjusted R-squared F-statistic Table 7. Same as Table 6 with the rank and based on the last 12-months downloads. Regression: ln ~ ln Estimate Standard t-statistic error Intercept ln Overall statistics Multiple/Adjusted R-squared F-statistic Table 8. Summary (using the function summary(lm())in R) for the cross-sectional (over all papers in the SSRN Top Papers data) linear regression of ln( ) over ln with the intercept. The regression formula reads lm(y ~ x) in R notations, where =ln( ) and =ln. Here the rank and are based on the last 12-months downloads. We keep the outliers in the regression not to inflate the statistics. Quantity Min. 1 st Quartile Median Mean 3 rd Quartile Max / ln( ) =ln( / ) / Table 9. Cross-sectional (across all authors in the SSRN Top Authors data) summaries for and the ratio / together with the factor (see Eq. (7)), its log and. NAs are omitted. Quantity Min. 1 st Quartile Median Mean 3 rd Quartile Max / ln( ) Table 10. Same as Table 9 for the indexes and (except there is no nontrivial analog of in this case see Subsection 3.1). The factor is defined in Subsection

24 Author Name, SSRN ID Total # of Downloads # of >0 Papers SSRN Rank Michael C. Jensen, Eugene F. Fama, Pablo Fernandez, Daniel J. Solove, Kenneth R. French, William H. Meckling, Daniel Kaufmann, Aart Kraay, Massimo Mastruzzi, Nassim Nicholas Taleb, Andrew Metrick, Werner Erhard, John R. Lott, Mathew O. Jackson, Kostas Koufopoulos, Stephen H. Penman, William N. Goetzmann, Attilio Meucci, Aswath Damodaran, K. Geert Rouwenhorst, Table 11. Top 20 SSRN authors by the index. The 6th column is the number of papers with at least 1 download. The 2nd column is rounded down to the 3rd decimal. The 4th column is rounded down to the nearest integer. The SSRN rank is based on the total number of downloads. All statistics are as of the date(s) of our downloads of the data (see Section 2). Regression: ~ ln( ) Estimate Standard t-statistic error Intercept ln( ) Overall statistics Multiple/Adjusted R-squared F-statistic Table 12. Summary (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data) linear regression of over ln( ) with the intercept. 24

25 Regression: ~ ln( ) Estimate Standard t-statistic error Intercept ln( ) Overall statistics Multiple/Adjusted R-squared F-statistic Table 13. Summary (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data) linear regression of over ln( ) with the intercept. Regression: ~ ln( ) Estimate Standard t-statistic error Intercept ln( ) Overall statistics Multiple/Adjusted R-squared F-statistic Table 14. Summary (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data) linear regression of over ln( ) with the intercept. Regression: ~ ln( ) Estimate Standard t-statistic error Intercept ln( ) Overall statistics Multiple/Adjusted R-squared F-statistic Table 15. Summary (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data) linear regression of over ln( ) with the intercept. 25

26 Regression: = = = = ~ ln( )+ln( ) Estimate: Intercept Estimate: ln( ) Estimate: ln( ) Standard Error: Intercept Standard Error: ln( ) Standard Error: ln( ) t-statistic: Intercept t-statistic: ln( ) t-statistic: ln( ) Multiple/Adjusted R-squared F-statistic Table 16. Summaries (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data) linear regressions of over the two explanatory variables ln( ) and ln( ) with the intercept, where =,,,. Cf. Tables Quantity Min. 1 st Quartile Median Mean 3 rd Quartile Max. (12 months) (all) (12 months) (all) (12 months) ( >1) (12 months) ( >1) (12 months) ( >2) (12 months) ( >2) Table 17. Cross-sectional (across all authors in the SSRN Top Papers data) summaries for the and indexes based on the last 12-months downloads obtained using the Top Papers data. Since we only have 9,999 papers with 11,871 authors (see Subsection 3.3), i.e., the data sample is in fact small despite a large number of papers, the data mostly has authors with only 1 or 2 papers. This causes the bulk of the index values to be artificially low (and the primary cause of this is not the fact that the bulk of the 12-mo download numbers is lower than the bulk of the overall download numbers by roughly a factor of 5-7; see the bottom two rows in Table 2). Therefore, we provide summaries for all 11,871 authors, the 2,834 authors with >1 papers, and the 1,310 authors with >2 papers. 26

27 Author Name, SSRN ID Total # of Downloads (12 mo) # of Papers (12 mo) Downloads Rank (12 mo) Pablo Fernandez, Michael C. Jensen, Aswath Damodaran, Attilio Meucci, Eugene F. Fama, Tobias J. Moskowitz, Campbell R. Harvey, Werner Erhard, Isabel Fernández Acín, Kenneth R. French, Lasse Heje Pedersen, Dan M. Kahan, Matthew O. Jackson, Cass R. Sunstein, George Serafeim, Clifford S. Asness, John R. Graham, Kari L. Granger, Guofu Zhou, Wade D. Pfau, Table 18. Top 20 SSRN authors by the index based on the last 12-months downloads obtained using the Top Papers data. See Table 1 for number rounding and other information. 27

28 Author Name, SSRN ID Total # of Downloads (12 mo) # of Papers (12 mo) Downloads Rank (12 mo) Pablo Fernandez, Matthew O. Jackson, Mebane T. Faber, Michael C. Jensen, Aswath Damodaran, Tobias J. Moskowitz, Eugene F. Fama, Clifford S. Asness, Kenneth R. French, Campbell R. Harvey, Isabel Fernández Acín, Pablo Linares, Lasse Heje Pedersen, Wade D. Pfau, Werner Erhard, Andrea Frazzini, Daniel J. Solove, Cass R. Sunstein, Attilio Meucci, Kari L. Granger, Table 19. Top 20 SSRN authors by the index based on the last 12-months downloads obtained using the Top Papers data. See Table 1 for number rounding and other information. Regression: ~ ln( ) =ln( ) = ln( / ) = = Estimate: Intercept Estimate: ln( ) Standard Error: Intercept Standard Error: ln( ) t-statistic: Intercept t-statistic: ln( ) Multiple/Adjusted R-squared F-statistic Table 20. Summaries (using the function summary(lm())in R) for the cross-sectional (over all authors in the SSRN Top Authors data with 1) linear regressions of over ln( ) with the intercept, where =ln( ), ln( / ),,. Here is the time in (years) from the date of the author s first posting of a paper ( Posted: field) on SSRN until August 17,

29 Regression: ln ~ ln +ln ( ) Estimate Standard error t-statistic Overall statistics Intercept ln ln ( ) Multiple/Adjusted R-squared F-statistic Table 21. Same as Table 6, except all quantities are based on the author webpage data. 29

30 Figures Figure 1. This figure illustrates the computation of the, and indexes for a randomly chosen author. The sloping diamonds correspond to ln( ). The horizontal circles correspond to = ln( ) in Eq. (2). The solid straight line has slope 1 and its intersection with the horizontal lines of circles gives the value of =5. The dotted straight line with the negative slope goes through the points (, ln( )) and ( +1, ln( )) and its intersection with the solid straight line determines. In this example we have =5.710 and =

31 Figure 2. Cross-sectional (across all authors in the SSRN Top Authors data) density for ln( ) and histogram for log ( ), where is the total number of downloads for each author. Left column: overall; right column: the last 12 months. 31

32 Figure 3. Cross-sectional (across all authors in the SSRN Top Authors data) density for ln( / ) and histogram for log ( / ), where is the total number of downloads for each author, while is the number of papers. Left column: overall; right column: the last 12 months. 32

33 Figure 4. The solid line (not a Gaussian) is the same as the upper-left density curve in Figure 3 (mean = and standard deviation = based on ln( / ), with maximum value = based on the density) for overall downloads. The diamonds correspond to the leastsquares fit Gaussian curve (mean = 5.329, standard deviation = 0.961, maximum value = 0.409, all based on the fit). 33

34 Figure 5. Downloads rank and ln( ) v. ln( ). Left column: overall; right column: the last 12 months. See Subsection 2.1 for details. 34

35 Figure 6. Cross-sectional (across all papers in the SSRN Top Papers data) density for ln( ) and histogram for log, where is the number of downloads for each paper. Left column: overall; right column: the last 12 months. 35

36 Figure 7. Downloads rank and ln( ) v. ln. Left column: overall; right column: the last 12 months. See Subsection 2.2 for details. 36

37 Figure 8. Upper-left: histogram of the index; upper-right: histogram of the number of papers with =1 (where = / ); lower-left: histogram of ln( ) (where is defined in Eq. (7)); lower-right: histogram of =ln( / ) /. See Section 3 for details. 37

38 Figure 9. This figure illustrates the computation of the and indexes for the same randomly chosen author as in Figure 1. The sloping diamonds correspond to ln( ), where is the average number of downloads for the first papers (the papers are ordered decreasingly with the numbers of downloads ). The horizontal circles correspond to = ln( ) in Eq. (9). The solid straight line has slope 1 and its intersection with the horizontal lines of circles gives the value of =6. The dotted straight line with the negative slope goes through the points (, ln( )) and ( +1, ln( )) and its intersection with the solid straight line determines. In this example we have = and =

39 Figure 10. Upper-left: histogram of the index; upper-right: histogram of the number of papers with =1 (where = / ); lower-left: histogram of ln( ) (where is defined in Subsection 3.1); lower-right: density of ln( / ), where excludes all papers with empty fields (so the latter do not alter the density curve shape, cf. Figure 3). See Subsection 3.1 for details. 39

40 Figure 11. Upper-left: Index = ; upper-right: Index = ; lower-left: Index = ; lower-right: Index =. Straight lines correspond to linear fits into the data (see Tables 12-15). 40

41 Figure 12. Upper-left: ln( ) v. ln( ); upper-right: ln( / ) v. ln( ); lower-left: v. ln( ); lower-right: v. ln( ). Here is the time in (years) from the date of the author s first posting of a paper ( Posted: field) on SSRN until August 17, Also see Table

Journal of Informetrics

Journal of Informetrics 2 (2008) 298 303 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locate/joi A symmetry axiom for scientific impact indices Gerhard