Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium

Size: px

Start display at page:

Download "Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium"

Kelley Tyler
5 years ago
Views:

1 Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium

2 Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure or development of science based on the relations between specific entities or units. It can be applied to all units associated with science like publications, disciplines, journals, institutions and researchers. Most likely, the results are plotted in a two- or three dimensional representation (a map).

3 Term networks based on TfiDF scores (Janssens et al. 2008)

4 Introduction Just like traditional cartography tries to model and communicate spatial information, mapping of science is about modeling quantitative relations between entities. In this process three crucial decisions have to be made. Which are the units or entities to be plotted? Which quantitative measure will be used to describe the relation among entities? Which analytical tools are appropriate (both for modeling and representation?

5 Introduction Börner, Chen and Boyack (2003) described 6 general steps in producing domain maps or visualizations. 1.Data Extraction: By searches or by broadening (narrowing) 2.Choice of Unit of Analysis 3.Measures: Counts, frequencies of attributes 4.Similarity: Correlation, Scalar, Vector based 5.Ordination: Clustering, dimensionality reduction 6.Display or visualization

6 Introduction Possible applications of Mapping Of Science: Information Retrieval Eg. In the nineties ISI developed the SCI-Map software. Henry Small (1994) applied this to build a map of AIDS research. Science Policy Tool Eg. Noyons (2004) presented a case study where he added citation indicators to a map of subdomains of bibliometrics, scientometrics and domain visualization Thijs & Glänzel (2008) developed a classification model in order to create groups institutions with the same research profile to enhance intercomparability in evaluative studies.

7 Introduction This lecture will mainly focus on the analytical part of the process: Selection and calculation of the similarity/distance measures Statistical and analytical tools for classification, clustering or plotting Additional auxiliary tools for dimensionality reduction Of course the choice of similarity and ordination depends on the desired application and subsequent on the choice of unit of analysis and relation.

8 Entities or Units of Analysis Publications Journals Authors Institutions Fields Countries

9 Relations between entities Citation based Collaboration or co-occurrence based Content Based

10 Relations: Citation based Direct links: Citations from one publication to another. This relation can also be applied on other entities: one author cites another one. Cross-citations: citations between entities. Citations between two publications are very rare. Citations between journals, authors, institutions are more common

11 Relations: Citation based Indirect links: Bibliographic Coupling (Kessler, 1963) Two documents are bibliographicly coupled if their respective reference lists share at least one reference. The strength is higher if they share more references, it is lower if the reference list are longe Co-Citation Two document are co-cited if they appear both in the reference list of one paper. Author Co-citation Analysis (ACA, ): Two authors are close related if they are cited in the same work.

12 Bibliographic Coupling Co-citation

13 Author 1 Author Co-Citation Author 2

14 Relations: Collaboration of Co-occurrence Authors : Thijs & Glänzel Countries: Belgium & Hungary Institutions: K.U.Leuven & Hungarian Academy of Sciences Fields: Journal Scientometrics is assigned to Computer Science Information Science & Library Science Co-citation = Co-occurence

15 Relations: Content based No other relation between entities is necessary, the relation is purely based on the content or topic of the research Research Profiles for Authors, Institutions and Countries Lexical similarities Co-Word (cf. Callon et al. 1993) Keywords (eg. Noyons, 1999) TF x idf Latent Semantic Analyis (Indexing) with Singular Value Decomposition

A B C E G H I M N O P R S U X Z 10.1% 4.5% 58.0% 14.0% 4.0% 3.7% 0.2% 0.6% 0.6% 0.1% 23.4% 5.0% 0.1% 0.0% 0.1% 7.

16 Relations: Content based The research profile of an institution is a vector in the fieldspace representing the share of each of the 16 fields in the total set of publications of the specific institution. A B C E G H I M N O P R S U X Z 10.1% 4.5% 58.0% 14.0% 4.0% 3.7% 0.2% 0.6% 0.6% 0.1% 23.4% 5.0% 0.1% 0.0% 0.1% 7.4% A B C E G H I M N O P R S U X Z 9.2% 23.1% 13.8% 6.2% 6.2% 1.5% 23.1% 15.4% 1.5% 0.0% 7.7% 12.3% 0.0% 0.0% 0.0% 35.4%

17 Relations: Content based Lexical similarities: Documents are represented as vectors where each element stands for a specific keyword, term or concept. Co-word analysis leaves the level of individual documents and creates pairs of keywords or terms. Term 1 Term 2 Term 3 Term 4 Doc a n a1 n a2 n a3 n a4 Term 1 Term 1 Term 2 Term 3 Term 4 Term 2 f 12 Doc b n b1 n b2 n b3 n b4 Term 3 f 13 f 23 Doc c n c1 n c2 n c3 n c4 Term 4 f 14 f 24 F 34 n c1 : number of occurences of term 1 in document c f 12 : number of co-occurences of term 1 and term 2

18 Relations: Content based Lexical similarities: Documents are represented as vectors where each element stands for a specific keyword, term or concept. TF x idf: Terms are not weighted by the number of occurrences but by the term frequency multiplied by the inverse document frequency Term 1 Term 2 Term 3 Term 4 Doc a n a1 n a2 n a3 n a4 Doc b n b1 n b2 n b3 n b4 Doc c n c1 n c2 n c3 n c4 Total Σn ai Σn bi Σn ci TF x idf = (n a1 /Σn ai )log(n/df 1 ) N df 1 df 2 df 3 df 4

19 Relations: Content based Lexical similarities: Latent Semantic Analysis by application of SVD to convert many terms into a limited set of concepts M = UΣV* Term 1 Term 2 Term 3 Term 4 Concept 1 Concept 2 Doc a TFiDF a1 TFiDF a2 TFiDF a3 TFiDF a4 Doc a score a1 score a2 Doc b TFiDF b1 TFiDF b2 TFiDF b3 TFiDF b4 Doc b score b1 score b2 Doc c TFiDF c1 TFiDF c2 TFiDF c3 TFiDF c4 Doc c score c1 score c2

20 From relations to Similarities (or distances) The relations between units described above are not yet sufficient to be used in mapping. Quantitative implementations are needed. Raw counts: citations Co-occurence similarities (Jaccard-index) Vector Space Model (Salton, cosine, Euclidean distance) If similarity is a value between 0 and 1 than the distance = 1- similarity

21 From relations to Similarities (or distances) Jaccard similarity coefficient : eg. Two authors have 7 joint papers. The first author has 10 papers in total the second has 35. J(A,B) = 7 / ( ) = Salton: The similarity is defined as the cosine of the angle between two vectors representing two documents. cos(θ) = A.B / A B

1a TFiDF 2a TFiDF 3a TFiDF 4a Doc b TFiDF 1b TFiDF 2b TFiDF 3b TFiDF 4b Doc c TFiDF

22 From relations to Similarities (or distances) Example of Salton s cosine similarity. n = total number of terms Term 1 Term 2 Term 3 Term 4 Doc a Doc b Doc c Doc a TFiDF 1a TFiDF 2a TFiDF 3a TFiDF 4a Doc b TFiDF 1b TFiDF 2b TFiDF 3b TFiDF 4b Doc c TFiDF 1c TFiDF 2c TFiDF 3c TFiDF 4c Doc a Doc b cos(v a,v b ) Doc c cos(v a,v c ) cos(v b,v c )

23 From relations to Similarities (or distances) Salton s cosine has also a binary implementation so that it also can be used for similarities based on collaboration or co-occurrence eg. Two authors have 7 joint papers. The first author has 10 papers in total the second has 35. S(A,B) = 7 / sqrt(10*35) = The set of papers by one author could be represented as a binary vector with as many cells as total number of papers in the set (38 in this example).

24 International collaboration in the field of environmental sciences of selected countries. Saltons cosine is used.

25 From relations to Similarities (or distances) Symmetrical vs. asymmetrical relations Most of the relations discussed above have a symmetrical nature. The direction of the relation has no influence on the strength. This is not an absolute necessity. Also asymmetrical relations between entities exist. Eg. Authors from a particular country attend conferences in other countries. In this case, the distance in-between one country does not even needs to be zero. Be aware, many analytical tools assume symmetric measures!

26 Scientopgraphical map with unidirectional and mutual affinity (Glänzel et al., 2006)

27 Similarities: Hybrid approach All above mentioned relations have drawbacks. Bibliographic coupling is to sparse. Only limited number of pairs have a similarity higher than 0. TFiDF: homonyms with different meaning might increase similarity. Possible solution: Hybrid approach in which two different relations are combined.

28 Similarities: Hybrid approach

29 Similarities: Hybrid approach

30 Analytical Techniques Hierachical Clustering is an analytical technique that creates a hierarchy of cluster. Each cluster at a higher level groups documents or clusters that are more distant from each other. The result is represented in a tree-like structure, a dendrogram. Different algorithms for the aggregation or separation of clusters are developed. The two most common used in science mapping are: Ward algorithm: An approach close to Analysis of Variance. It tries to minimize the Sum of Squares Average linking: The distance between clusters is based on the average distance between all members.

31 Analytical Techniques Hierachical Clustering One of the main problems with this type of clustering is the decision of number of clusters. The procedure results in the tree structure. The cut-off has to be decide afterwards Inspection of dendrogram Qualitative judgment of cluster result Statistics: Silhouette values,

32 A I P X Analytical Techniques Example of Hierachical Clustering with Ward s method and Euclidean distance Institute Research Profile Z B M N G E S O R C H U Ward Clustering BIO AGR MDS GSS TNS CHE GRM SPM

Analytical Techniques Example of Hierachical Clustering with Ward s method and cosine distance: Publication in field Energy & Fuels) Silhouette value (Rousseeuw, 1987)

33 Analytical Techniques Example of Hierachical Clustering with Ward s method and cosine distance: Publication in field Energy & Fuels) Silhouette value (Rousseeuw, 1987) compares the distance from one object to other objects in the same cluster to the distance to all objects outside the cluster. Values close to 1 indicate a good clustering.

34 Analytical Techniques K-means clustering aims to partition the entities into k clusters. Each entity belongs to the cluster with the nearest mean. This procedure runs iteratively. It starts with k different means for each cluster and assigns each observation to one of them. In a second step the k means are updated. The iteration stops when the total shift of the means becomes marginal. The original algorithm takes only vectors as input and no similarities. This means that this clustering technique only can be applied to a limited set of relations.

35 Analytical Techniques Two features of k-means clustring have a crucial influence on the final solution 1. The choice of number of cluster has to be made prior to any analysis. This has to be based on available knowledge on the topic. Sometimes a hierarchical clustering is used to get an indication 2. Step 1 starts with k different means. These can be randomly chosen but also be deducted from a hierarchical clustering

36 Analytical Techniques Tools for the representation of similarities Multidimensional scaling: This methodology plots the matrix of similarities in an N-dimensional space. Several implementations of the methodologies are available ALSCAL PROXSCAL These implementations are very sensitive to the total number of entities as the complete similarity matrix has to be processed. Kamada-Kawai is much faster in achieving convergence and can layout networks of much higher sizes. The algorithm is implemented in Pajek, a program for network analysis and visualization

37 Analytical Techniques Kamada-kawai representation of 50 documents with hybrid links. This graph is made with Pajek.

38 Kamada-kawai representation of 7 and 8 clusters in the field Energy & Fuels. The links between clusters are citation links

²Austrian Center of Competence for Tribology, AC2T research GmbH, Viktor-Kaplan-Straße 2-C, A Wiener Neustadt, Austria

²Austrian Center of Competence for Tribology, AC2T research GmbH, Viktor-Kaplan-Straße 2-C, A Wiener Neustadt, Austria Bibliometric field delineation with heat maps of bibliographically coupled publications using core documents and a cluster approach - the case of multiscale simulation and modelling (research in progress)