Social Data Mining Trainer: Enrico De Santis, PhD

Size: px

Start display at page:

Download "Social Data Mining Trainer: Enrico De Santis, PhD"

Robyn Walton
5 years ago
Views:

1 Social Data Mining Trainer: Enrico De Santis, PhD CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

2 Outlines Vector Semantics From plain text to mathematical representations Linear algebra in pills Assessing the semantic content in text term-context matrix and term document matrix Similarity e dissimilarity computation The cosine similarity family Applications 2

3 Vector Semantics One of the biggest obstacles to making full use of the power of computers is that they currently understand very little of the meaning of human language. The term Semantics is used here in a general sense, as the meaning of a word, a phrase, a sentence, or any text in human language, and the study of such meaning. We are not concerned with narrower senses of semantics, such as the semantic web or approaches to semantics based on formal logic. 3

Vector Semantics What is Vector Semantics?

related techniques to social network analysis

ferried from the humanities world to the

.. Di Joachim Patinir - Museo Nacional del

4 Vector Semantics What is Vector Semantics? To understand vector semantics and to apply related techniques to social network analysis and text analysis it is necessary to be ferried from the humanities world to the fantastic world of math... Di Joachim Patinir - Museo Nacional del Prado, Pubblico dominio, 4

5 Vector Semantics To face any type of text analysis it is of paramount importance to represent documents in a corpus in a mathematical space. We will see how to embed a set of documents or words in a vector space, hence applying all algebraic properties: To measure the similarity between words and grasp their semantics. To measure the similarity between documents. To extract a set of features and build a logic structure for applying advanced analysis techniques such as supervised and unsupervised learning (machine learning). 5

6 Before to start Words and documents will be suitably embedded in a vector space. Before to start it is fruitful to recall some notions used sometimes explicitly, other times not explicitly, in all future analysis. We will recall: the notion of vector space and vectors; the basic manipulations of vectors; the notion of matrix and the basic manipulations; how to calculate the distance between two vectors; 6

7 Background: space, vectors, matrix A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars in this context. Graphically vectors are represented as arrows, but they are regarded as abstract mathematical objects with particular properties Geometrically vectors represent points in a given space. Vector addition and scalar multiplication: a vector vv (blue) is added to another vector ww (red) ww is stretched by a factor of 2, yielding the sum vv + 2 ww. 7

Background: space, vectors, matrices Mathematically the bi-dimensional surface of a table or the tri-dimensional space in which our body moves, are examples of vector spaces.

8 Background: space, vectors, matrices Mathematically the bi-dimensional surface of a table or the tri-dimensional space in which our body moves, are examples of vector spaces. The vector pp is expressed by its components: pp = [ aa 1, aa 2, aa 3 ] Dimension = 3 (space: R 3 ) pp Components or coordinates in a Cartesian space aa 1 aa 2 aa 3 In a n-dimensional space, given n non aligned vectors all other vectors can be expressed mathematically by a (linear) combination of these n vectors. 8

Consider the vectors e 1 = (1,0,0), e 2 = (0,1,0) and e 3 = (0,0,1).

9 Background: linear combination If vv 1, vv 2,, vv nn are vectors and a 1,...,a n are scalars, then the linear combination of those vectors with those scalars as coefficients is: ww = aa 1 vv 1 + aa 2 vv 2 +,, +aa nn vv nn Example: Consider the vectors e 1 = (1,0,0), e 2 = (0,1,0) and e 3 = (0,0,1). Then any vector in R 3 is a linear combination of e 1, e 2 and e 3 : To see that take an arbitrary vector in R 3 : xx = [aa 1, aa 2, aa 3 ] and write: xx = [aa 1, aa 2, aa 3 ] = aa 1, 0,0 + 0, aa 2, 0 + 0,0, aa 3 = aa 1 1,0,0 + aa 2 0,1,0 + aa 3 0,0,1 = aa 1 ee 1 + aa 2 ee 2 + aa 3 ee 3. We use only the addition and the multiplication for a scalar! 9

10 Background: linear dependence A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others; if no vector in the set can be written in this way, then the vectors are said to be linearly independent. 10

11 Background: scalar product between two vectors Given two vectors vv and ww (in any dimension) it is defined the scalar product as: vv ww = vv 1 ww 1 + vv 2 ww 2 +,, +vv nn ww nn = scalar value (number). If vv is orthogonal to ww then vv ww=0 (vectors forms an angle of 90 degree). Through the scalar product it is possible to define the length of a vector ww also called (Euclidean) vector norm ww : ww =length= ww ww = ww 1 ww 1 + ww 2 ww 2 +,, +ww nn ww nn. The scalar product helps to define the angle between two vectors: vv ww cosine aaaaaaaaaa vv, ww = ww vv vv ww ww vv 11

12 Background: Euclidean distance! Given two vectors vv and ww (in any dimension) it is defined the Euclidean distance dd ww, vv : dd ww, vv = ww vv = ww vv ww vv = = (vv 1 ww 1 ) 2 +(vv 2 ww 2 ) 2 +,, +(vv nn ww nn ) 2 = = nn ii=1 (vv ii ww ii ) 2. ww There are many possible definitions of distance in machine learning, one example is the weighted Euclidean distance: given a set of weights aa 1, aa 2, aa nn we have: vv dd ww, vv; aa = nn ii=1 aa ii (vv ii ww ii ) 2. 12

13 Background: Matrices A matrix AA = [aa iiii ] is a table of values or a collection of row vectors or column vectors: Rows space Columns space Matrix multiplication 13

The idea of the VSM is to represent each document in a collection as a point in a space (a vector in a vector space).

14 Vector Space Model and information retrieval The VSM was developed for the SMART information retrieval system (Salton, 1971) by Gerard Salton and his colleagues (Salton, Wong, & Yang, 1975). The idea of the VSM is to represent each document in a collection as a point in a space (a vector in a vector space). Points that are close together in this space are semantically similar and points that are far apart are semantically distant. The user s query is represented as a point in the same space as the documents (the query is a pseudo-document) 14

15 Distributional models of meaning = vector-space models of meaning = vector semantics Intuitions: Zellig Harris (1954): oculist and eye-doctor occur in almost the same environments If A and B have almost identical environments we say that they are synonyms. Firth (1957): You shall know a word by the company it keeps! 15

16 Vector semantics Nida example: A bottle of tesgüino is on the table Everybody likes tesgüino Tesgüino makes you drunk We make tesgüino out of corn. From context words humans can guess tesgüino means an alcoholic beverage like beer Intuition for algorithm: Two words are similar if they have similar word contexts. 16

17 Therefore a working hypothesis statistical semantics hypothesis: statistical patterns of human word usage can be used to figure out what people mean (George Furnas, University of Michigan). Similarity of Words: The Word-Context Matrix. Similarity of Documents: The Term-Document Matrix. 17

18 The Word-Context Matrix Wittgenstein was primarily interested in the physical activities that form the context of word usage (e.g., the word brick, spoken in the context of the physical activity of building a house). The distributional hypothesis in linguistics is that words that occur in similar contexts tend to have similar meanings (Harris, 1954). A word may be represented by a vector in which the elements are derived from the occurrences of the word in various contexts, such as windows of words (Lund & Burgess, 1996). They can be used also richer contexts such as grammatical dependencies (Lin, 1998; Pado & Lapata, 2007), or dependency graphs between words. 18

19 The Word-Context Matrix The Word-Context matrix known also as the Term-Term matrix is a matrix in which columns are labeled by words. Indicating with V the number of unique words in a corpus or document (i.e. types) the matrix is of dimension V V, each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. Usually the context is a window around the word, for example of 7 words to the left and 7 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a 7 word window around the row word. 19

20 The Word-Context Matrix Example from Brown Corpus: Context window Context window 7 7 word Sample of the Word-Context matrix (the real matrix is higher) constructed as raw frequency of the cooccurrence of two words The graph is a spatial visualization of word vectors for digital and information, showing just two of the dimensions, corresponding to the words data and result. Note that V, the length of the vector, is generally the size of the vocabulary, usually between 10,000 and 50,000 words. Most of these numbers are zero hence the matrix is called sparse. 20

21 The Word-Context Matrix The size of the window used to collect counts can vary based on the goals of the representation, but is generally between 1 and 8 words on each side of the target word (for a total context of 3-17 words). In general, the shorter the window, the more syntactic the representations, since the information is coming from immediately nearby words; the longer the window, the more semantic the relations. Two words have first-order co-occurrence (sometimes called syntagmatic association) if they are typically nearby each other. Thus wrote is a first-order associate of book or poem. Two words have second-order cooccurrence (sometimes called paradigmatic association) if they have similar neighbors. Thus wrote is a second order associate of words like said or remarked. 21

22 The W-C Matrix: Measuring the association between words It turns out, however, that simple frequency count isn t the best measure of association between words. If we want to know what kinds of contexts are shared by apricot and pineapple but not by digital and information, we re not going to get good discrimination from words like the, it, or they, which occur frequently with all sorts of words and aren t informative about any particular word. Instead we d like context words that are particularly informative about the target word. The best weighting or measure of association between words should tell us how much more often than chance the two words co-occur. 22

23 The Mutual Information and Pointwise Mutual Information (PMI) Pointwise mutual information is just such a measure (Church and Hanks, 1989) and (Church and Hanks, 1990). The mutual information between two random variables XX and YY is: PP(xx, yy) II XX, YY = PP xx, yy llllll 2 ( PP xx PP(yy) ) xx yy The pointwise mutual information (Fano, 1961) is a measure of how often two events xx and yy occur, compared with what we would expect if they were independent: PP(xx, yy) II xx, yy = llllll 2 PP xx, PP(yy) 23

24 The Positive Pointwise Mutual Information (PPMI) We can apply this intuition to co-occurrence vectors by defining the pointwise mutual information association between a target word ww and a context word cc as: PP(ww, cc) PPPPPP ww, cc = llllll 2 PP ww PP(cc) The numerator tells us how often we observed the two words together. The denominator tells us how often we would expect the two words to cooccur assuming they each occurred independently (so their probabilities could just be multiplied). Thus, the ratio gives us an estimate of how much more the target and feature co-occur than we expect by chance. 24

25 The Positive Pointwise Mutual Information (PPMI) PMI values range from negative to positive infinity. Negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous. Furthermore it s not clear whether it s even possible to evaluate such scores of unrelatedness with human judgments. It is common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero. PP(ww, cc) PPPPPPPP ww, cc = max(llllll 2 PP ww PP(cc), 0) 25

26 PPMI, an example Let s assume we have a co-occurrence matrix FF with WW rows (words) and CC columns (contexts), where ff iiii gives the number of times word ww ii occurs in context cc jj. CC (cccccccccccc) WW (rrrrrrrr) 26

27 PPMI, an example Thus for example we could compute PPPPPPPP(ww = iiiiiiiiiiiiiiiiiiiiii, cc = dddddddd), assuming we pretended that Fig. below encompassed all the relevant word contexts/dimensions. CONTEXT Joint probabilities and marginals PPMI values 27

28 The Term-Document Matrix In a Term-Document matrix or Word-Document matrix, each row represents a word in the vocabulary and each column represents a document from some collection. Each cell in this matrix represents the number of times a particular word (defined by the row) occurs in a particular document (defined by the column). 28

Base Hypothesis: Bag of Words order does not matter; Documents and queries are both vectors.

29 The Term-Document Matrix Used for document indexing (e.g. search engines) Retrieval based on similarity between documents. Similarity based on occurrence frequencies of keywords in query and document. Base Hypothesis: Bag of Words order does not matter; Documents and queries are both vectors. For each term, ii, in a document or query, jj, is given a real-valued weight, ww iiii Both documents and queries are expressed as a VV -dimensional vectors: dd jj = (ww 1jj, ww 2jj,, ww VV jj ). 29

30 The Term-Document Matrix Each cell: count of term t in a document d: tf t,d : Each document is a count vector in N V below. a column We can think of the vector for a document as identifying a point in V -dimensional space. As You Like It Twelfth Night Julius Caesar Henry V battle soldier fool clown

of the words in the document we may also use functions of this

31 The Term-Document Matrix V dimension of the vocabulary This is the general form of a t-d matrix where instead the frequency of the words in the document we may also use functions of this frequency like the log frequency or other weighting schemes. 31

32 Term-Document Matrix - weighting More frequent terms in a document are more important, i.e. more indicative of the topic. ff iiii frequency of term ii in document jj ; May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tttt iiii = ff iiii max jj ff iiii ; Terms that appear in many different documents are less indicative of overall topic. ddff ii document frequency of term ii = (number of documents containing term ii ); iiiiff ii = log2 (NN/ dddd ii ) inverse document frequency of term ii, (NN: total number of documents); A typical combined term importance indicator is tf-idf weighting: ww iiii = tttt iiii iiiiii ii = tttt iiii log2 (NN/ dddd ii ) ; A term occurring frequently in the document but rarely in the rest of the collection is given high weight (other approaches can be used). 32

33 T-D Matrix Measuring the Similarity Dual representation: (1) documents in term space or (2) terms in document space tt 1 dd ii = ww iii tt 1 +ww iii tt 2 + +ww iiii tt nn dd 1 tt ss = ww 1ss dd 1 +ww 2ss dd 2 + +ww iiii dd ii (1) (2) tt nn dd ii tt 3 tt 2 XX Cosine similarity between two documents: dd 3 dd 2 XX TT n d j dk w i 1 i, jw = ik, sim( d j, dk) = = d j dk w w n 2 n 2 i= 1 i, j i= 1 ik, Useful for measuring semantical similarity of terms given a corpus A query document qq can be compared with each column of document-term matrix XX and results can be used for ranking (document indexing). 33

34 Dot Product and cosine similarity Most metrics for similarity between vectors are based on the dot product. The dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. vv ww = nn ii=1 vv ii ww ii = vv 1 ww 1 + vv 2 ww 2 +,, +vv nn ww nn Recall the vector length: vv = ii vv ii 2. The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. The raw dot product thus will be higher for frequent words (problem). Solution: The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors or prenormalizing each vector to have unitary length: cosine vv, ww = vv ww ww vv 34

35 How about the correlation metric? Hence the cosine similarity coincides with the dot product when the involved vectors are normalized with unitary length. Both similarity measures are based on dot product. How about correlation? Cosine similarity is not invariant to shifts. If x was shifted to x+1, the cosine similarity would change. What is invariant, though, is the Pearson correlation. If, ww, vv are the respective mean values of ww, vv: CCCCCCCC ww, vv = ii vv ii vv ii ww ii ww ii vv ii vv 2 ii ww ii ww 2 = vv vv ww ww vv vv ww ww = cccccccccccc(ww ww, vv vv) 35

36 Correlation metrics: an application Hierarchical clustering: clustering of vectors as a way to visualize what words are most similar to other ones (Rohde et al., 2006). Using hierarchical clustering to visualize 4 noun classes from the embeddings produced by Rohde et al. (2006). These embeddings use a window size of 4, and 14,000 dimensions, with 157 closed-class words removed. This visualization uses hierarchical clustering, with correlation as the similarity function 36

37 Cosine similarity: an example Depending on the application the cosine similarity can be used in Term-Document matrix or Term-Term matrix (it can be used for any vector space). Let s see how the cosine computes which of the words apricot or digital is closer in meaning to information, just using raw counts from the following simplified table: The model decides that information is closer to digital than it is to apricot, a result that seems sensible. Recall: for small angles the cosine measure is higher more similar 37

38 Applications The Term-Document matrix is arranged with words-types in row and documents in column. It can be useful in assessing the meaning of words. Depending on the application it can be fruitful considering the transpose of this matrix, i.e. the Document-Term matrix, which has the documents as rows and the wordstypes as columns. The distinction depends on the task we would like to accomplish. For example in machine learning (ML) and Data Analysis, such as in documents classification we may want to classify documents (i.e. in Sentiment Analysis). Considering the Document-Term matrix as a dataset, words-types are considered as variables (or features in ML jargon) and documents as measurements (or patterns in ML and Pattern Recognition jargon). 38

39 Conclusions Text mining applications are heavily based on vector semantics. Vector semantics works with Vector space models that ar in charge of transforming plain text in a suitable mathematical space (vectorial space) letting to use the power of linear algebra and measure similarity or dissimilarity of objects. As we will see, in machine learning the availability of vectorial data is of paramount importance because we dispose of features describing objects on which a suitable learning procedure can be adopted through a suitable computer program. E.g. the term-document matrix is an important building block in Social Data mining, specifically in text analysis and allow applications such as thematic analysis of contents, text summarization, sentiment analysis and opinion mining and so on... However, we can imagine a similar matrix structure consisting of different features than words and documents. Features can be all interesting information that we want to use to represent object extracted from Social Media platforms (age, gender, geolocation, hardware used...). 39

Word Meaning and Similarity. Word Similarity: Distributional Similarity (I)

Word Meaning and Similarity. Word Similarity: Distributional Similarity (I) Word Meaning and Similarity Word Similarity: Distributional Similarity (I) Problems with thesaurus-based meaning We don t have a thesaurus for every language Even if we do, they have problems with recall