Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium

Size: px
Start display at page:

Download "Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium"

Transcription

1 Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium

2 Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure or development of science based on the relations between specific entities or units. It can be applied to all units associated with science like publications, disciplines, journals, institutions and researchers. Most likely, the results are plotted in a two- or three dimensional representation (a map).

3 Term networks based on TfiDF scores (Janssens et al. 2008)

4 Introduction Just like traditional cartography tries to model and communicate spatial information, mapping of science is about modeling quantitative relations between entities. In this process three crucial decisions have to be made. Which are the units or entities to be plotted? Which quantitative measure will be used to describe the relation among entities? Which analytical tools are appropriate (both for modeling and representation?

5 Introduction Börner, Chen and Boyack (2003) described 6 general steps in producing domain maps or visualizations. 1.Data Extraction: By searches or by broadening (narrowing) 2.Choice of Unit of Analysis 3.Measures: Counts, frequencies of attributes 4.Similarity: Correlation, Scalar, Vector based 5.Ordination: Clustering, dimensionality reduction 6.Display or visualization

6 Introduction Possible applications of Mapping Of Science: Information Retrieval Eg. In the nineties ISI developed the SCI-Map software. Henry Small (1994) applied this to build a map of AIDS research. Science Policy Tool Eg. Noyons (2004) presented a case study where he added citation indicators to a map of subdomains of bibliometrics, scientometrics and domain visualization Thijs & Glänzel (2008) developed a classification model in order to create groups institutions with the same research profile to enhance intercomparability in evaluative studies.

7 Introduction This lecture will mainly focus on the analytical part of the process: Selection and calculation of the similarity/distance measures Statistical and analytical tools for classification, clustering or plotting Additional auxiliary tools for dimensionality reduction Of course the choice of similarity and ordination depends on the desired application and subsequent on the choice of unit of analysis and relation.

8 Entities or Units of Analysis Publications Journals Authors Institutions Fields Countries

9 Relations between entities Citation based Collaboration or co-occurrence based Content Based

10 Relations: Citation based Direct links: Citations from one publication to another. This relation can also be applied on other entities: one author cites another one. Cross-citations: citations between entities. Citations between two publications are very rare. Citations between journals, authors, institutions are more common

11 Relations: Citation based Indirect links: Bibliographic Coupling (Kessler, 1963) Two documents are bibliographicly coupled if their respective reference lists share at least one reference. The strength is higher if they share more references, it is lower if the reference list are longe Co-Citation Two document are co-cited if they appear both in the reference list of one paper. Author Co-citation Analysis (ACA, ): Two authors are close related if they are cited in the same work.

12 Bibliographic Coupling Co-citation

13 Author 1 Author Co-Citation Author 2

14 Relations: Collaboration of Co-occurrence Authors : Thijs & Glänzel Countries: Belgium & Hungary Institutions: K.U.Leuven & Hungarian Academy of Sciences Fields: Journal Scientometrics is assigned to Computer Science Information Science & Library Science Co-citation = Co-occurence

15 Relations: Content based No other relation between entities is necessary, the relation is purely based on the content or topic of the research Research Profiles for Authors, Institutions and Countries Lexical similarities Co-Word (cf. Callon et al. 1993) Keywords (eg. Noyons, 1999) TF x idf Latent Semantic Analyis (Indexing) with Singular Value Decomposition

16 Relations: Content based The research profile of an institution is a vector in the fieldspace representing the share of each of the 16 fields in the total set of publications of the specific institution. A B C E G H I M N O P R S U X Z 10.1% 4.5% 58.0% 14.0% 4.0% 3.7% 0.2% 0.6% 0.6% 0.1% 23.4% 5.0% 0.1% 0.0% 0.1% 7.4% A B C E G H I M N O P R S U X Z 9.2% 23.1% 13.8% 6.2% 6.2% 1.5% 23.1% 15.4% 1.5% 0.0% 7.7% 12.3% 0.0% 0.0% 0.0% 35.4%

17 Relations: Content based Lexical similarities: Documents are represented as vectors where each element stands for a specific keyword, term or concept. Co-word analysis leaves the level of individual documents and creates pairs of keywords or terms. Term 1 Term 2 Term 3 Term 4 Doc a n a1 n a2 n a3 n a4 Term 1 Term 1 Term 2 Term 3 Term 4 Term 2 f 12 Doc b n b1 n b2 n b3 n b4 Term 3 f 13 f 23 Doc c n c1 n c2 n c3 n c4 Term 4 f 14 f 24 F 34 n c1 : number of occurences of term 1 in document c f 12 : number of co-occurences of term 1 and term 2

18 Relations: Content based Lexical similarities: Documents are represented as vectors where each element stands for a specific keyword, term or concept. TF x idf: Terms are not weighted by the number of occurrences but by the term frequency multiplied by the inverse document frequency Term 1 Term 2 Term 3 Term 4 Doc a n a1 n a2 n a3 n a4 Doc b n b1 n b2 n b3 n b4 Doc c n c1 n c2 n c3 n c4 Total Σn ai Σn bi Σn ci TF x idf = (n a1 /Σn ai )log(n/df 1 ) N df 1 df 2 df 3 df 4

19 Relations: Content based Lexical similarities: Latent Semantic Analysis by application of SVD to convert many terms into a limited set of concepts M = UΣV* Term 1 Term 2 Term 3 Term 4 Concept 1 Concept 2 Doc a TFiDF a1 TFiDF a2 TFiDF a3 TFiDF a4 Doc a score a1 score a2 Doc b TFiDF b1 TFiDF b2 TFiDF b3 TFiDF b4 Doc b score b1 score b2 Doc c TFiDF c1 TFiDF c2 TFiDF c3 TFiDF c4 Doc c score c1 score c2

20 From relations to Similarities (or distances) The relations between units described above are not yet sufficient to be used in mapping. Quantitative implementations are needed. Raw counts: citations Co-occurence similarities (Jaccard-index) Vector Space Model (Salton, cosine, Euclidean distance) If similarity is a value between 0 and 1 than the distance = 1- similarity

21 From relations to Similarities (or distances) Jaccard similarity coefficient : eg. Two authors have 7 joint papers. The first author has 10 papers in total the second has 35. J(A,B) = 7 / ( ) = Salton: The similarity is defined as the cosine of the angle between two vectors representing two documents. cos(θ) = A.B / A B

22 From relations to Similarities (or distances) Example of Salton s cosine similarity. n = total number of terms Term 1 Term 2 Term 3 Term 4 Doc a Doc b Doc c Doc a TFiDF 1a TFiDF 2a TFiDF 3a TFiDF 4a Doc b TFiDF 1b TFiDF 2b TFiDF 3b TFiDF 4b Doc c TFiDF 1c TFiDF 2c TFiDF 3c TFiDF 4c Doc a Doc b cos(v a,v b ) Doc c cos(v a,v c ) cos(v b,v c )

23 From relations to Similarities (or distances) Salton s cosine has also a binary implementation so that it also can be used for similarities based on collaboration or co-occurrence eg. Two authors have 7 joint papers. The first author has 10 papers in total the second has 35. S(A,B) = 7 / sqrt(10*35) = The set of papers by one author could be represented as a binary vector with as many cells as total number of papers in the set (38 in this example).

24 International collaboration in the field of environmental sciences of selected countries. Saltons cosine is used.

25 From relations to Similarities (or distances) Symmetrical vs. asymmetrical relations Most of the relations discussed above have a symmetrical nature. The direction of the relation has no influence on the strength. This is not an absolute necessity. Also asymmetrical relations between entities exist. Eg. Authors from a particular country attend conferences in other countries. In this case, the distance in-between one country does not even needs to be zero. Be aware, many analytical tools assume symmetric measures!

26 Scientopgraphical map with unidirectional and mutual affinity (Glänzel et al., 2006)

27 Similarities: Hybrid approach All above mentioned relations have drawbacks. Bibliographic coupling is to sparse. Only limited number of pairs have a similarity higher than 0. TFiDF: homonyms with different meaning might increase similarity. Possible solution: Hybrid approach in which two different relations are combined.

28 Similarities: Hybrid approach

29 Similarities: Hybrid approach

30 Analytical Techniques Hierachical Clustering is an analytical technique that creates a hierarchy of cluster. Each cluster at a higher level groups documents or clusters that are more distant from each other. The result is represented in a tree-like structure, a dendrogram. Different algorithms for the aggregation or separation of clusters are developed. The two most common used in science mapping are: Ward algorithm: An approach close to Analysis of Variance. It tries to minimize the Sum of Squares Average linking: The distance between clusters is based on the average distance between all members.

31 Analytical Techniques Hierachical Clustering One of the main problems with this type of clustering is the decision of number of clusters. The procedure results in the tree structure. The cut-off has to be decide afterwards Inspection of dendrogram Qualitative judgment of cluster result Statistics: Silhouette values,

32 A I P X Analytical Techniques Example of Hierachical Clustering with Ward s method and Euclidean distance Institute Research Profile Z B M N G E S O R C H U Ward Clustering BIO AGR MDS GSS TNS CHE GRM SPM

33 Analytical Techniques Example of Hierachical Clustering with Ward s method and cosine distance: Publication in field Energy & Fuels) Silhouette value (Rousseeuw, 1987) compares the distance from one object to other objects in the same cluster to the distance to all objects outside the cluster. Values close to 1 indicate a good clustering.

34 Analytical Techniques K-means clustering aims to partition the entities into k clusters. Each entity belongs to the cluster with the nearest mean. This procedure runs iteratively. It starts with k different means for each cluster and assigns each observation to one of them. In a second step the k means are updated. The iteration stops when the total shift of the means becomes marginal. The original algorithm takes only vectors as input and no similarities. This means that this clustering technique only can be applied to a limited set of relations.

35 Analytical Techniques Two features of k-means clustring have a crucial influence on the final solution 1. The choice of number of cluster has to be made prior to any analysis. This has to be based on available knowledge on the topic. Sometimes a hierarchical clustering is used to get an indication 2. Step 1 starts with k different means. These can be randomly chosen but also be deducted from a hierarchical clustering

36 Analytical Techniques Tools for the representation of similarities Multidimensional scaling: This methodology plots the matrix of similarities in an N-dimensional space. Several implementations of the methodologies are available ALSCAL PROXSCAL These implementations are very sensitive to the total number of entities as the complete similarity matrix has to be processed. Kamada-Kawai is much faster in achieving convergence and can layout networks of much higher sizes. The algorithm is implemented in Pajek, a program for network analysis and visualization

37 Analytical Techniques Kamada-kawai representation of 50 documents with hybrid links. This graph is made with Pajek.

38 Kamada-kawai representation of 7 and 8 clusters in the field Energy & Fuels. The links between clusters are citation links

²Austrian Center of Competence for Tribology, AC2T research GmbH, Viktor-Kaplan-Straße 2-C, A Wiener Neustadt, Austria

²Austrian Center of Competence for Tribology, AC2T research GmbH, Viktor-Kaplan-Straße 2-C, A Wiener Neustadt, Austria Bibliometric field delineation with heat maps of bibliographically coupled publications using core documents and a cluster approach - the case of multiscale simulation and modelling (research in progress)

More information

The Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc.

The Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc. The Accuracy of Network Visualizations Kevin W. Boyack SciTech Strategies, Inc. kboyack@mapofscience.com Overview Science mapping history Conceptual Mapping Early Bibliometric Maps Recent Bibliometric

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

Similarity Measures, Author Cocitation Analysis, and Information Theory

Similarity Measures, Author Cocitation Analysis, and Information Theory Similarity Measures, Author Cocitation Analysis, and Information Theory Journal of the American Society for Information Science & Technology, 56(7), 2005, 769-772. Loet Leydesdorff Science & Technology

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Towards Effective KM Tools

Towards Effective KM Tools Towards Effective KM Tools Dr. Katy Börner Cyberinfrastructure for Network Science Center, Director Information Visualization Laboratory, Director School of Library and Information Science Indiana University,

More information

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Michal Rott, Petr Červa Laboratory of Computer Speech Processing 4. 9. 2014 Introduction Idea of article clustering Presumptions:

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

RESEARCH AREAS IN THE ERC RESEARCH

RESEARCH AREAS IN THE ERC RESEARCH Parker / SPL / Agentur Focus ERACEP: IDENTIFYING EMERGING RESEARCH AREAS IN THE ERC RESEARCH PROPOSALS F i n a l W o r k s h o p D B F & E R A C E P C S A s 2 0 / 2 1 F e b r u a r y 2 0 1 3, B r u s s

More information

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare

More information

Visualisation of knowledge domains in interdisciplinary research organizations

Visualisation of knowledge domains in interdisciplinary research organizations Neversdorf, September 2010 Visualisation of knowledge domains in interdisciplinary research organizations Ismael Rafols 1,2 1 SPRU -- Science and Technology Policy Research University of Sussex, Brighton,

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Clusters. Unsupervised Learning. Luc Anselin.   Copyright 2017 by Luc Anselin, All Rights Reserved Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Ontology-Based News Recommendation

Ontology-Based News Recommendation Ontology-Based News Recommendation Wouter IJntema Frank Goossen Flavius Frasincar Frederik Hogenboom Erasmus University Rotterdam, the Netherlands frasincar@ese.eur.nl Outline Introduction Hermes: News

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Information Science in the web era: a term-based approach to domain mapping.

Information Science in the web era: a term-based approach to domain mapping. Information Science in the web era: a term-based approach to domain mapping. Fidelia Ibekwe-Sanjuan To cite this version: Fidelia Ibekwe-Sanjuan. Information Science in the web era: a term-based approach

More information

658 C.-P. Hu et al. Highlycited Indicators published by Chinese Scientific and Technical Information Institute and Wanfang Data Co., Ltd. demonstrates

658 C.-P. Hu et al. Highlycited Indicators published by Chinese Scientific and Technical Information Institute and Wanfang Data Co., Ltd. demonstrates 658 C.-P. Hu et al. Highlycited Indicators published by Chinese Scientific and Technical Information Institute and Wanfang Data Co., Ltd. demonstrates the basic status of LIS journals in China through

More information

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models 3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Mapping research topics using word-reference co-occurrences: A method and an exploratory case study

Mapping research topics using word-reference co-occurrences: A method and an exploratory case study Jointly published by Akadémiai Kiadó, Budapest Scientometrics, Vol. 68, No. 3 (2006) 377 393 and Springer, Dordrecht Mapping research topics using word-reference co-occurrences: A method and an exploratory

More information

On the map: Nature and Science editorials

On the map: Nature and Science editorials Scientometrics (2011) 86:99 112 DOI 10.1007/s11192-010-0205-9 On the map: Nature and Science editorials Cathelijn J. F. Waaijer Cornelis A. van Bochove Nees Jan van Eck Received: 9 February 2010 / Published

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

Lecture 2: Data Analytics of Narrative

Lecture 2: Data Analytics of Narrative Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization

More information

Journal of Informetrics

Journal of Informetrics Journal of Informetrics 5 (2011) 146 166 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locate/joi An approach for detecting, quantifying, and visualizing

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,

More information

Mapping Interdisciplinary Research Domains

Mapping Interdisciplinary Research Domains Mapping Interdisciplinary Research Domains Katy Börner School of Library and Information Science katy@indiana.edu Presentation at the Parmenides Center for the Study of Thinking, Island of Elba, Italy

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May 16, 2013 Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD

More information

Locating an Astronomy and Astrophysics Publication Set in a Map of the Full Scopus Database

Locating an Astronomy and Astrophysics Publication Set in a Map of the Full Scopus Database Locating an Astronomy and Astrophysics Publication Set in a Map of the Full Scopus Database Kevin W. Boyack 1 1 kboyack@mapofscience.com SciTech Strategies, Inc., 8421 Manuel Cia Pl NE, Albuquerque, NM

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project

More information

Visualization and analysis of SCImago Journal & Country Rank structure via journal clustering

Visualization and analysis of SCImago Journal & Country Rank structure via journal clustering Visualization and analysis of SCImago Journal & Country Rank structure via journal clustering Antonio J. Gómez-Núñez CSIC, SCImago Research Group Associated Unit. Faculty of Communication and Documentation,

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Learning representations

Learning representations Learning representations Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 4/11/2016 General problem For a dataset of n signals X := [ x 1 x

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Collection Frequency, cf Define: The total

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Generalized Blockmodeling with Pajek

Generalized Blockmodeling with Pajek Metodološki zvezki, Vol. 1, No. 2, 2004, 455-467 Generalized Blockmodeling with Pajek Vladimir Batagelj 1, Andrej Mrvar 2, Anuška Ferligoj 3, and Patrick Doreian 4 Abstract One goal of blockmodeling is

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Visualizing the Evolution of a Subject Domain: A Case Study

Visualizing the Evolution of a Subject Domain: A Case Study Visualizing the Evolution of a Subject Domain: A Case Study Chaomei Chen Brunel University Leslie Carr Southampton University Abstract We explore the potential of information visualization techniques in

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Creation of a highly detailed, dynamic, global model and map of science

Creation of a highly detailed, dynamic, global model and map of science Creation of a highly detailed, dynamic, global model and map of science Kevin W. Boyack * and Richard Klavans ** * kboyack@mapofscience.com SciTech Strategies, Inc., Albuquerque, NM 87122 (USA) ** rklavans@mapofscience.com

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,

More information

Comparison between ontology distances (preliminary results)

Comparison between ontology distances (preliminary results) Comparison between ontology distances (preliminary results) Jérôme David, Jérôme Euzenat ISWC 2008 1 / 17 Context A distributed and heterogenous environment: the semantic Web Several ontologies on the

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: k nearest neighbors Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 28 Introduction Classification = supervised method

More information

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Using global mapping to create more accurate document-level maps of research fields

Using global mapping to create more accurate document-level maps of research fields Using global mapping to create more accurate document-level maps of research fields Richard Klavans a & Kevin W. Boyack b a SciTech Strategies, Inc., Berwyn, PA 19312 USA (rklavans@mapofscience.com) b

More information

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent

More information

Embeddings Learned By Matrix Factorization

Embeddings Learned By Matrix Factorization Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix

More information

Semantic Similarity and Relatedness

Semantic Similarity and Relatedness Semantic Relatedness Semantic Similarity and Relatedness (Based on Budanitsky, Hirst 2006 and Chapter 20 of Jurafsky/Martin 2 nd. Ed. - Most figures taken from either source.) Many applications require

More information

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

How rural the EU RDP is? An analysis through spatial funds allocation

How rural the EU RDP is? An analysis through spatial funds allocation How rural the EU RDP is? An analysis through spatial funds allocation Beatrice Camaioni, Roberto Esposti, Antonello Lobianco, Francesco Pagliacci, Franco Sotte Department of Economics and Social Sciences

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 9: Lexical semantics (Feb 19, 2019) David Bamman, UC Berkeley Lexical semantics You shall know a word by the company it keeps [Firth 1957] Harris 1954

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Copy of slides and exercises PAST software download

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

Intelligent Data Analysis Lecture Notes on Document Mining

Intelligent Data Analysis Lecture Notes on Document Mining Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.

More information