Empirical Bayes approaches to PageRank type algorithms for rating scientific journals

Size: px

Start display at page:

Download "Empirical Bayes approaches to PageRank type algorithms for rating scientific journals"

Julie Pierce
6 years ago
Views:

1 Empirical Bayes approaches to PageRank type algorithms for rating scientific journals Jean-Louis Foulley, Gilles Celeux, Julie Josse INRIA, École Polytechnique May 29, 2017 Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

2 Overview 1 Motivations 2 A Bayesian Dirichlet-multinomial model without diagonal Model Estimating the hyperparameters in an empirical Bayes framework 3 Ranking statistical journals Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

3 Motivation Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

4 Assessing and ranking journals Impact Factor (Garfield, 1972): average number of annual citations received / published article in the last 2 years. Impacts scientific life. Highly criticized: + or - assessment of citations; depends on the field; citation window too narrow; self-citations influence; equal weight to each quotation whatever its origin, etc. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

5 Assessing and ranking journals Impact Factor (Garfield, 1972): average number of annual citations received / published article in the last 2 years. Impacts scientific life. Highly criticized: + or - assessment of citations; depends on the field; citation window too narrow; self-citations influence; equal weight to each quotation whatever its origin, etc. Alternatives: increase the citation window, standardize by field, etc... Take into account the importance of citing sources: Group lasso, stochastic bloc models, etc. (Varin et al., 2016) Google PageRank algorithms (Waltman et van Eck, 2010) : Prestige Scimago Journal Rank (PSJR) EigenFactor (EIFA): excludes self-citations Widely use: ease of computation. Lacks a probabilistic model. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

6 PageRank influence scores Table: C N N N cross-citation: citing (issuing ref) in rows and cited (receiving cit) in cols. c ij : number of times i, in a given year, quotes articles published by j over a previous period of time (5 years). AmS AISM AoS ANZS Bern 1 AmS AISM AoS ANZS Bern BioJ Adjacency matrix P p ij := ( c ij c i+ ): weighted oriented network citing cited. PageRank: random surfer of Google, teleportation matrix π = 1 N 1 N G = αp + (1 α)1 N π. (1) Discrete-time, irreducible & aperiodic Markov chain between the N nodes. The parameter α damping factor a Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

7 PageRank influence scores PageRank (Brin and Page, 1998) r l+1 j = N i=1 g ij r l i, j = 1,..., N. (2) A quotation from JRRS-B does not have the same weight than others. Solution: eigenvector r with unit norm, r = r G. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

8 PageRank influence scores PageRank (Brin and Page, 1998) r l+1 j = N i=1 g ij r l i, j = 1,..., N. (2) A quotation from JRRS-B does not have the same weight than others. Solution: eigenvector r with unit norm, r = r G. EIFA Bergstrom (2007): self-citations excluded (c ii = 0, p ii = 0, i); ( π = ai a + ): freq of references published by each journal in 5 years. r eigenvector of G. r = P r and r = r. This is equivalent to r r = 1 N r (1 α)π. (3) α EIFA introduces the teleportation part and then attenuate its effect. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

9 A Bayesian Dirichlet-multinomial model without diagonal Model Empirical Bayes Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

10 A Bayesian Dirichlet-multinomial model without diagonal Let C i := (c ij ): the row i of C without its diagonal elements. C (N N 1) 1 Multinomial sampling of the elements of C i C i θ i M(n i, θ i ) n i = j i c ij and θ i = (θ i1,..., θ i,i 1, θ i,i+1,..., θ i,n ). 2 Dirichlet prior distributions θ i γ \i D(γ \i ) where γ \i = (γ 1, γ 2,..., γ i 1, γ i+1,..., γ N ). Posterior distributions are Dirichlet: θ i γ \i, C i D(C i + γ\i ), i = 1,..., N. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

11 A Bayesian Dirichlet-multinomial model without diagonal Epost(θ ij ) = c ij + γ j j i c ij + j i γ, for j i, i = 1,..., N. j G i = α i p i + (1 α i )(π i ) Parametrization with prior expectation E(θ i ) = (π i ) : prior proba that i cites any other journals. πi = γ \i K with K \i \i = K γ i, K = N i=1 γ i. K \i, concentration: total number of fictive quotes given by i to others. Same form as EIFA but with shrinkage factor no longer fixed: α i = n i n i + K \i, 0 α i 1: large when n i large & K \i small (large prior variance). Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

12 Empirical Bayes approach L(C γ) = Π N i=1l(c i γ \i ) L(C i γ \i ) = p(c i θ i )p(θ i γ \i )dθ i Multinomial component: p(c i θ i ) = n i! Π j i c ij! Π j iθ c ij ij Dirichlet component: p(θ i γ \i ) = Γ( j i γ j) Π j i Γ(γ j ) Π j iθ γ j 1 ij ( ) n i!γ L i (C j i γ j Γ(c i γ \i ) = ( ij + γ j ) )Π j i. Π j i c ij!γ j i (c ij + γ j ) Γ(γ j ) Clear distinction between structural and sampling zeroes. Multinomial part: no impact, it multiplies the likelihood by 1. Dirichlet part: impact, θ i of size (N 1) instead of N. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

13 Fixed point algorithm The log-likelihood can be written as: L i (γ \i ) = logγ(k \i ) logγ(n i + K \i ) + [logγ(c ij + γ j ) logγ(γ j )], and ( j i its gradient can be written as: dl i (C i γ \i ) dγ j = ψ(k \i ) + ψ(n i + K \i ) + ψ(c ij + γ j ) ψ(γ j ), for all i j First order algorithms: Majorize-Minimize algorithm. Fixed point iteration algorithm (Minka, 2012) γ l+1 j = γ l j i j ψ(c ij + γ l j ) (N 1)ψ(γl j ) i j ψ(n i + K l \i ) ψ(k l \i ). Second order algorithms: Levenberg-Marquardt, EM variant. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

14 Ranking statistical journals Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

15 Statistical journals 47 statistical journals (Varin, 2016), citations published in 2010 related to articles published from 2001 to EBEF α i = 0.95 for CSDA, STMED α i = 0.39 STATAJ (mean 0.77). Teleportation decreasing with the number of references emitted. EIFA 2 scores with self-citations: EBPR (Empirical Bayes Page Rank): Multinomial-Dirichlet with diag Prestige Scimago Journal Rank (PSJR) (Scimago Lab, Scopus) G 2 = α 2 P + (1 α 2 β)1π + β 11 N, π := a i /a +, α 2 = 0.90, β = First eigenvector: G 2 r 2 = r 2. self-citations restricted to 33% Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

16 Total scores & rankings Journal PSJR EBPR EIFA EBEF EBPR PSJR EIFA EBEF 1 JASA JASA JASA JASA JASA 2 AOS AOS AOS AOS AOS 3 JRSS-B JRSS-B JRSS-B JRSS-B JRSS-B 4 BKA STMED BKA BKA BKA 5 BCS BCS BCS BCS BCS 6 STMED BKA STMED STMED STMED 7 JSPI CSDA CSDA JSPI JSPI 8 CSDA JSPI JSPI CSDA CSDA 9 STSIN STSIN STSIN STSIN STSIN 10 JMA JMA JMA JMA JMA 11 BIOST SPL BIOST BIOST BIOST 12 JCGS BIOST SPL JCGS JCGS 13 SPL JCGS JCGS SPL SPL 14 SJS STSCI STSCI STSCI SJS 15 STSCI SJS SJS SJS STSCI 16 BERN JSS BERN BERN BERN 17 CJS BERN CJS CJS CJS 18 STCMP CSTM STCMP STCMP STCMP 19 BIOJ SMMR TECH BIOJ BIOJ 20 TECH BIOJ BIOJ CSTM TECH 21 CSTM STCMP CSTM TECH CSTM 22 JRSS-C TECH JRSS-A JRSS-C JRSS-C 23 TEST EES AMS JRSS-A TEST 24 JRSS-A CJS JRSS-C TEST JRSS-A 25 AISM JRSS-A AISM AISM AISM 26 AMS JBS TEST AMS AMS 27 JNS AMS LTA LTA JNS 28 LTA JRSS-C JNS JNS LTA 29 JSCS JNS ENVR ENVR JSCS 30 ENVR AISM JSCS SMMR ENVR 31 SMMR TEST JSS JSCS SMMR Julie 32Josse MTKA (Polytechnique) Empirical 5.52 Bayes 5.77PageRank 32 ENVR SMMR May MTKA 29, 2017 MTKA 14 / 16

17 Articles scores & rankings Journal PSJR EBPR EIFA EBEF EBPR PSJR EIFA EBEF 1 JRSS-B JRSS-B JRSS-B JRSS-B JRSS-B 2 STSCI AOS AOS STSCI STSCI 3 JASA STSCI STSCI JASA JASA 4 AOS JASA JASA AOS AOS 5 BKA BKA BKA BKA BKA 6 SJS BCS BCS BCS SJS 7 JCGS JCGS JCGS JCGS JCGS 8 BCS SJS SJS SJS BCS 9 TEST CJS TEST TEST TEST 10 CJS TEST CJS CJS CJS 11 BERN JRSS-A TECH BERN BERN 12 TECH TECH BERN LTA TECH 13 LTA BERN JRSS-A TECH LTA 14 JRSS-A LTA LTA JRSS-A JRSS-A 15 JMA STMED JMA JMA JMA 16 STMOD JMA STMED STMOD STMOD 17 AMS AMS AMS STMED AMS 18 ISR ISR ISR AMS ISR 19 AISM STMOD STMOD ISR AISM 20 STMED CSDA CSDA AISM STMED 21 STCMP AISM AISM STCMP STCMP 22 CSDA STCMP STCMP CSDA CSDA 23 JSPI JSPI JSPI JSPI JSPI 24 ANZS BIOJ BIOJ BIOJ ANZS 25 BIOJ JTSA ANZS BIOST BIOJ 26 JRSS-C BIOST JTSA ANZS JRSS-C 27 STNEE ENVR BIOST JRSS-C STNEE 28 STATS JRSS-C ENVR ENVR STATS 29 BIOST ANZS JRSS-C JTSA BIOST 30 JTSA STATAJ STNEE STATS JTSA 31 ENVR STATS STATS STNEE ENVR Julie 32 Josse CMPST (Polytechnique) Empirical 0.54 Bayes 0.59 PageRank 32 STNEE STSIN May STSIN 29, 2017 CMPST 15 / 16

18 Discussion EBEF for EigenFactor with Dirichlet-multinomial model (cor 0.90) Teleportation varies from one journal to another Simplicity - Quick (6s/ 35s second order algo) Taking into account the zeros: exclusion of a specific field. Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

19 Discussion EBEF for EigenFactor with Dirichlet-multinomial model (cor 0.90) Teleportation varies from one journal to another Simplicity - Quick (6s/ 35s second order algo) Taking into account the zeros: exclusion of a specific field. Underweight self-citations in a data-driven way? (PSJR 33%). total received (R) / total made (M): S i = c +i c i+ S i (κ) = κc ii + R i κc ii + M i. = c ii +R i c ii +M i. Let κ [0, 1] if S i (0) < 1, S i (κ) increasing function of κ upper bounded by 1. if S i (0) = 1, S i (κ) = 1 for any κ. if S i (0) > 1, S i (κ) decreasing function of κ lower bounded by 1. ( ) Good journals no interest in self-citations: κ i = min min(ri,m i ) c ii, 1 Julie Josse (Polytechnique) Empirical Bayes PageRank May 29, / 16

Empirical Bayes approaches to PageRank type algorithms for rating scientific journals

Empirical Bayes approaches to PageRank type algorithms for rating scientific journals Jean-Louis Foulley, Gilles Celeux, Julie Josse To cite this version: Jean-Louis Foulley, Gilles Celeux, Julie Josse.