Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium
|
|
- Kelley Tyler
- 5 years ago
- Views:
Transcription
1 Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium
2 Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure or development of science based on the relations between specific entities or units. It can be applied to all units associated with science like publications, disciplines, journals, institutions and researchers. Most likely, the results are plotted in a two- or three dimensional representation (a map).
3 Term networks based on TfiDF scores (Janssens et al. 2008)
4 Introduction Just like traditional cartography tries to model and communicate spatial information, mapping of science is about modeling quantitative relations between entities. In this process three crucial decisions have to be made. Which are the units or entities to be plotted? Which quantitative measure will be used to describe the relation among entities? Which analytical tools are appropriate (both for modeling and representation?
5 Introduction Börner, Chen and Boyack (2003) described 6 general steps in producing domain maps or visualizations. 1.Data Extraction: By searches or by broadening (narrowing) 2.Choice of Unit of Analysis 3.Measures: Counts, frequencies of attributes 4.Similarity: Correlation, Scalar, Vector based 5.Ordination: Clustering, dimensionality reduction 6.Display or visualization
6 Introduction Possible applications of Mapping Of Science: Information Retrieval Eg. In the nineties ISI developed the SCI-Map software. Henry Small (1994) applied this to build a map of AIDS research. Science Policy Tool Eg. Noyons (2004) presented a case study where he added citation indicators to a map of subdomains of bibliometrics, scientometrics and domain visualization Thijs & Glänzel (2008) developed a classification model in order to create groups institutions with the same research profile to enhance intercomparability in evaluative studies.
7 Introduction This lecture will mainly focus on the analytical part of the process: Selection and calculation of the similarity/distance measures Statistical and analytical tools for classification, clustering or plotting Additional auxiliary tools for dimensionality reduction Of course the choice of similarity and ordination depends on the desired application and subsequent on the choice of unit of analysis and relation.
8 Entities or Units of Analysis Publications Journals Authors Institutions Fields Countries
9 Relations between entities Citation based Collaboration or co-occurrence based Content Based
10 Relations: Citation based Direct links: Citations from one publication to another. This relation can also be applied on other entities: one author cites another one. Cross-citations: citations between entities. Citations between two publications are very rare. Citations between journals, authors, institutions are more common
11 Relations: Citation based Indirect links: Bibliographic Coupling (Kessler, 1963) Two documents are bibliographicly coupled if their respective reference lists share at least one reference. The strength is higher if they share more references, it is lower if the reference list are longe Co-Citation Two document are co-cited if they appear both in the reference list of one paper. Author Co-citation Analysis (ACA, ): Two authors are close related if they are cited in the same work.
12 Bibliographic Coupling Co-citation
13 Author 1 Author Co-Citation Author 2
14 Relations: Collaboration of Co-occurrence Authors : Thijs & Glänzel Countries: Belgium & Hungary Institutions: K.U.Leuven & Hungarian Academy of Sciences Fields: Journal Scientometrics is assigned to Computer Science Information Science & Library Science Co-citation = Co-occurence
15 Relations: Content based No other relation between entities is necessary, the relation is purely based on the content or topic of the research Research Profiles for Authors, Institutions and Countries Lexical similarities Co-Word (cf. Callon et al. 1993) Keywords (eg. Noyons, 1999) TF x idf Latent Semantic Analyis (Indexing) with Singular Value Decomposition
16 Relations: Content based The research profile of an institution is a vector in the fieldspace representing the share of each of the 16 fields in the total set of publications of the specific institution. A B C E G H I M N O P R S U X Z 10.1% 4.5% 58.0% 14.0% 4.0% 3.7% 0.2% 0.6% 0.6% 0.1% 23.4% 5.0% 0.1% 0.0% 0.1% 7.4% A B C E G H I M N O P R S U X Z 9.2% 23.1% 13.8% 6.2% 6.2% 1.5% 23.1% 15.4% 1.5% 0.0% 7.7% 12.3% 0.0% 0.0% 0.0% 35.4%
17 Relations: Content based Lexical similarities: Documents are represented as vectors where each element stands for a specific keyword, term or concept. Co-word analysis leaves the level of individual documents and creates pairs of keywords or terms. Term 1 Term 2 Term 3 Term 4 Doc a n a1 n a2 n a3 n a4 Term 1 Term 1 Term 2 Term 3 Term 4 Term 2 f 12 Doc b n b1 n b2 n b3 n b4 Term 3 f 13 f 23 Doc c n c1 n c2 n c3 n c4 Term 4 f 14 f 24 F 34 n c1 : number of occurences of term 1 in document c f 12 : number of co-occurences of term 1 and term 2
18 Relations: Content based Lexical similarities: Documents are represented as vectors where each element stands for a specific keyword, term or concept. TF x idf: Terms are not weighted by the number of occurrences but by the term frequency multiplied by the inverse document frequency Term 1 Term 2 Term 3 Term 4 Doc a n a1 n a2 n a3 n a4 Doc b n b1 n b2 n b3 n b4 Doc c n c1 n c2 n c3 n c4 Total Σn ai Σn bi Σn ci TF x idf = (n a1 /Σn ai )log(n/df 1 ) N df 1 df 2 df 3 df 4
19 Relations: Content based Lexical similarities: Latent Semantic Analysis by application of SVD to convert many terms into a limited set of concepts M = UΣV* Term 1 Term 2 Term 3 Term 4 Concept 1 Concept 2 Doc a TFiDF a1 TFiDF a2 TFiDF a3 TFiDF a4 Doc a score a1 score a2 Doc b TFiDF b1 TFiDF b2 TFiDF b3 TFiDF b4 Doc b score b1 score b2 Doc c TFiDF c1 TFiDF c2 TFiDF c3 TFiDF c4 Doc c score c1 score c2
20 From relations to Similarities (or distances) The relations between units described above are not yet sufficient to be used in mapping. Quantitative implementations are needed. Raw counts: citations Co-occurence similarities (Jaccard-index) Vector Space Model (Salton, cosine, Euclidean distance) If similarity is a value between 0 and 1 than the distance = 1- similarity
21 From relations to Similarities (or distances) Jaccard similarity coefficient : eg. Two authors have 7 joint papers. The first author has 10 papers in total the second has 35. J(A,B) = 7 / ( ) = Salton: The similarity is defined as the cosine of the angle between two vectors representing two documents. cos(θ) = A.B / A B
22 From relations to Similarities (or distances) Example of Salton s cosine similarity. n = total number of terms Term 1 Term 2 Term 3 Term 4 Doc a Doc b Doc c Doc a TFiDF 1a TFiDF 2a TFiDF 3a TFiDF 4a Doc b TFiDF 1b TFiDF 2b TFiDF 3b TFiDF 4b Doc c TFiDF 1c TFiDF 2c TFiDF 3c TFiDF 4c Doc a Doc b cos(v a,v b ) Doc c cos(v a,v c ) cos(v b,v c )
23 From relations to Similarities (or distances) Salton s cosine has also a binary implementation so that it also can be used for similarities based on collaboration or co-occurrence eg. Two authors have 7 joint papers. The first author has 10 papers in total the second has 35. S(A,B) = 7 / sqrt(10*35) = The set of papers by one author could be represented as a binary vector with as many cells as total number of papers in the set (38 in this example).
24 International collaboration in the field of environmental sciences of selected countries. Saltons cosine is used.
25 From relations to Similarities (or distances) Symmetrical vs. asymmetrical relations Most of the relations discussed above have a symmetrical nature. The direction of the relation has no influence on the strength. This is not an absolute necessity. Also asymmetrical relations between entities exist. Eg. Authors from a particular country attend conferences in other countries. In this case, the distance in-between one country does not even needs to be zero. Be aware, many analytical tools assume symmetric measures!
26 Scientopgraphical map with unidirectional and mutual affinity (Glänzel et al., 2006)
27 Similarities: Hybrid approach All above mentioned relations have drawbacks. Bibliographic coupling is to sparse. Only limited number of pairs have a similarity higher than 0. TFiDF: homonyms with different meaning might increase similarity. Possible solution: Hybrid approach in which two different relations are combined.
28 Similarities: Hybrid approach
29 Similarities: Hybrid approach
30 Analytical Techniques Hierachical Clustering is an analytical technique that creates a hierarchy of cluster. Each cluster at a higher level groups documents or clusters that are more distant from each other. The result is represented in a tree-like structure, a dendrogram. Different algorithms for the aggregation or separation of clusters are developed. The two most common used in science mapping are: Ward algorithm: An approach close to Analysis of Variance. It tries to minimize the Sum of Squares Average linking: The distance between clusters is based on the average distance between all members.
31 Analytical Techniques Hierachical Clustering One of the main problems with this type of clustering is the decision of number of clusters. The procedure results in the tree structure. The cut-off has to be decide afterwards Inspection of dendrogram Qualitative judgment of cluster result Statistics: Silhouette values,
32 A I P X Analytical Techniques Example of Hierachical Clustering with Ward s method and Euclidean distance Institute Research Profile Z B M N G E S O R C H U Ward Clustering BIO AGR MDS GSS TNS CHE GRM SPM
33 Analytical Techniques Example of Hierachical Clustering with Ward s method and cosine distance: Publication in field Energy & Fuels) Silhouette value (Rousseeuw, 1987) compares the distance from one object to other objects in the same cluster to the distance to all objects outside the cluster. Values close to 1 indicate a good clustering.
34 Analytical Techniques K-means clustering aims to partition the entities into k clusters. Each entity belongs to the cluster with the nearest mean. This procedure runs iteratively. It starts with k different means for each cluster and assigns each observation to one of them. In a second step the k means are updated. The iteration stops when the total shift of the means becomes marginal. The original algorithm takes only vectors as input and no similarities. This means that this clustering technique only can be applied to a limited set of relations.
35 Analytical Techniques Two features of k-means clustring have a crucial influence on the final solution 1. The choice of number of cluster has to be made prior to any analysis. This has to be based on available knowledge on the topic. Sometimes a hierarchical clustering is used to get an indication 2. Step 1 starts with k different means. These can be randomly chosen but also be deducted from a hierarchical clustering
36 Analytical Techniques Tools for the representation of similarities Multidimensional scaling: This methodology plots the matrix of similarities in an N-dimensional space. Several implementations of the methodologies are available ALSCAL PROXSCAL These implementations are very sensitive to the total number of entities as the complete similarity matrix has to be processed. Kamada-Kawai is much faster in achieving convergence and can layout networks of much higher sizes. The algorithm is implemented in Pajek, a program for network analysis and visualization
37 Analytical Techniques Kamada-kawai representation of 50 documents with hybrid links. This graph is made with Pajek.
38 Kamada-kawai representation of 7 and 8 clusters in the field Energy & Fuels. The links between clusters are citation links
²Austrian Center of Competence for Tribology, AC2T research GmbH, Viktor-Kaplan-Straße 2-C, A Wiener Neustadt, Austria
Bibliometric field delineation with heat maps of bibliographically coupled publications using core documents and a cluster approach - the case of multiscale simulation and modelling (research in progress)
More informationThe Accuracy of Network Visualizations. Kevin W. Boyack SciTech Strategies, Inc.
The Accuracy of Network Visualizations Kevin W. Boyack SciTech Strategies, Inc. kboyack@mapofscience.com Overview Science mapping history Conceptual Mapping Early Bibliometric Maps Recent Bibliometric
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationLecture 5: Web Searching using the SVD
Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially
More information13 Searching the Web with the SVD
13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationSimilarity Measures, Author Cocitation Analysis, and Information Theory
Similarity Measures, Author Cocitation Analysis, and Information Theory Journal of the American Society for Information Science & Technology, 56(7), 2005, 769-772. Loet Leydesdorff Science & Technology
More information9 Searching the Internet with the SVD
9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More informationTowards Effective KM Tools
Towards Effective KM Tools Dr. Katy Börner Cyberinfrastructure for Network Science Center, Director Information Visualization Laboratory, Director School of Library and Information Science Indiana University,
More informationInvestigation of Latent Semantic Analysis for Clustering of Czech News Articles
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Michal Rott, Petr Červa Laboratory of Computer Speech Processing 4. 9. 2014 Introduction Idea of article clustering Presumptions:
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationRESEARCH AREAS IN THE ERC RESEARCH
Parker / SPL / Agentur Focus ERACEP: IDENTIFYING EMERGING RESEARCH AREAS IN THE ERC RESEARCH PROPOSALS F i n a l W o r k s h o p D B F & E R A C E P C S A s 2 0 / 2 1 F e b r u a r y 2 0 1 3, B r u s s
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationVisualisation of knowledge domains in interdisciplinary research organizations
Neversdorf, September 2010 Visualisation of knowledge domains in interdisciplinary research organizations Ismael Rafols 1,2 1 SPRU -- Science and Technology Policy Research University of Sussex, Brighton,
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationVector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationClusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved
Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationOntology-Based News Recommendation
Ontology-Based News Recommendation Wouter IJntema Frank Goossen Flavius Frasincar Frederik Hogenboom Erasmus University Rotterdam, the Netherlands frasincar@ese.eur.nl Outline Introduction Hermes: News
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 1
Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects
More informationInformation Science in the web era: a term-based approach to domain mapping.
Information Science in the web era: a term-based approach to domain mapping. Fidelia Ibekwe-Sanjuan To cite this version: Fidelia Ibekwe-Sanjuan. Information Science in the web era: a term-based approach
More information658 C.-P. Hu et al. Highlycited Indicators published by Chinese Scientific and Technical Information Institute and Wanfang Data Co., Ltd. demonstrates
658 C.-P. Hu et al. Highlycited Indicators published by Chinese Scientific and Technical Information Institute and Wanfang Data Co., Ltd. demonstrates the basic status of LIS journals in China through
More informationMotivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models
3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationPrediction of Citations for Academic Papers
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationMapping research topics using word-reference co-occurrences: A method and an exploratory case study
Jointly published by Akadémiai Kiadó, Budapest Scientometrics, Vol. 68, No. 3 (2006) 377 393 and Springer, Dordrecht Mapping research topics using word-reference co-occurrences: A method and an exploratory
More informationOn the map: Nature and Science editorials
Scientometrics (2011) 86:99 112 DOI 10.1007/s11192-010-0205-9 On the map: Nature and Science editorials Cathelijn J. F. Waaijer Cornelis A. van Bochove Nees Jan van Eck Received: 9 February 2010 / Published
More informationDimensionality Reduction
Dimensionality Reduction Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationLecture 2: Data Analytics of Narrative
Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationWhat is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.
What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization
More informationJournal of Informetrics
Journal of Informetrics 5 (2011) 146 166 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locate/joi An approach for detecting, quantifying, and visualizing
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationDeep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017
Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationMapping Interdisciplinary Research Domains
Mapping Interdisciplinary Research Domains Katy Börner School of Library and Information Science katy@indiana.edu Presentation at the Parmenides Center for the Study of Thinking, Island of Elba, Italy
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationData Mining and Matrices
Data Mining and Matrices 05 Semi-Discrete Decomposition Rainer Gemulla, Pauli Miettinen May 16, 2013 Outline 1 Hunting the Bump 2 Semi-Discrete Decomposition 3 The Algorithm 4 Applications SDD alone SVD
More informationLocating an Astronomy and Astrophysics Publication Set in a Map of the Full Scopus Database
Locating an Astronomy and Astrophysics Publication Set in a Map of the Full Scopus Database Kevin W. Boyack 1 1 kboyack@mapofscience.com SciTech Strategies, Inc., 8421 Manuel Cia Pl NE, Albuquerque, NM
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationLeverage Sparse Information in Predictive Modeling
Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Clustering Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu May 2, 2017 Announcements Homework 2 due later today Due May 3 rd (11:59pm) Course project
More informationVisualization and analysis of SCImago Journal & Country Rank structure via journal clustering
Visualization and analysis of SCImago Journal & Country Rank structure via journal clustering Antonio J. Gómez-Núñez CSIC, SCImago Research Group Associated Unit. Faculty of Communication and Documentation,
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationClustering Lecture 1: Basics. Jing Gao SUNY Buffalo
Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationLearning representations
Learning representations Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 4/11/2016 General problem For a dataset of n signals X := [ x 1 x
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationDATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD
DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationMatrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang
Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Collection Frequency, cf Define: The total
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationInformation Retrieval. Lecture 6
Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture
More informationGeneralized Blockmodeling with Pajek
Metodološki zvezki, Vol. 1, No. 2, 2004, 455-467 Generalized Blockmodeling with Pajek Vladimir Batagelj 1, Andrej Mrvar 2, Anuška Ferligoj 3, and Patrick Doreian 4 Abstract One goal of blockmodeling is
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationVisualizing the Evolution of a Subject Domain: A Case Study
Visualizing the Evolution of a Subject Domain: A Case Study Chaomei Chen Brunel University Leslie Carr Southampton University Abstract We explore the potential of information visualization techniques in
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationCreation of a highly detailed, dynamic, global model and map of science
Creation of a highly detailed, dynamic, global model and map of science Kevin W. Boyack * and Richard Klavans ** * kboyack@mapofscience.com SciTech Strategies, Inc., Albuquerque, NM 87122 (USA) ** rklavans@mapofscience.com
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationComparison between ontology distances (preliminary results)
Comparison between ontology distances (preliminary results) Jérôme David, Jérôme Euzenat ISWC 2008 1 / 17 Context A distributed and heterogenous environment: the semantic Web Several ontologies on the
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: k nearest neighbors Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 28 Introduction Classification = supervised method
More informationLarge-scale Image Annotation by Efficient and Robust Kernel Metric Learning
Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1
Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of
More informationPart I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes
Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with
More informationSimilarity and Dissimilarity
1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.
More informationUsing global mapping to create more accurate document-level maps of research fields
Using global mapping to create more accurate document-level maps of research fields Richard Klavans a & Kevin W. Boyack b a SciTech Strategies, Inc., Berwyn, PA 19312 USA (rklavans@mapofscience.com) b
More informationFall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent
More informationEmbeddings Learned By Matrix Factorization
Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix
More informationSemantic Similarity and Relatedness
Semantic Relatedness Semantic Similarity and Relatedness (Based on Budanitsky, Hirst 2006 and Chapter 20 of Jurafsky/Martin 2 nd. Ed. - Most figures taken from either source.) Many applications require
More informationCS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya
CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationHow rural the EU RDP is? An analysis through spatial funds allocation
How rural the EU RDP is? An analysis through spatial funds allocation Beatrice Camaioni, Roberto Esposti, Antonello Lobianco, Francesco Pagliacci, Franco Sotte Department of Economics and Social Sciences
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 9: Lexical semantics (Feb 19, 2019) David Bamman, UC Berkeley Lexical semantics You shall know a word by the company it keeps [Firth 1957] Harris 1954
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationMultivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis
Multivariate Statistics 101 Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Copy of slides and exercises PAST software download
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationIntelligent Data Analysis Lecture Notes on Document Mining
Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.
More information