What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Size: px

Start display at page:

Download "What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured."

Godfrey Moses Flowers
5 years ago
Views:

2 What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

3 Text mining What can be used for text mining?? Classification/categorization Clustering Summarization Retrieval.

4 Pre-processing of text Tokenisation: Separation of tokens with removal of special symbols that are not required in the text. Stemming: Convering the words like playing, played into play. Lemmatisation: Returning the base form of the word. Eg: heard hear Case folding: Conversion of case- caps to small Stop word removal: the, an, on are called stop words. They are of limited use when it comes to determine the weight of a document for retreival. Normalisation: Equivalence classing. Use of synonyms. Spell check too can be performed here.

5 Different models for representation Term frequency and weighing Bag of words: number of occurance of word where the exact ordering is ignored. Vector space model and so on.

6 Term frequency Term frequency is the number of times, the term occurs in the document. Eg: Cricket is a game. Sam likes the game of cricket. Ter ms Cricket is a game Sam likes the of Freq nor mali sed 2/10 1/10 1/10 2/10 1/10 1/10 1/10 1/10 Each documents varies in size. Thus the frequency of terms differs with the size. And it impacts the smaller ones Thus it is normalised

7 Inverse document frequency The whole intension for the terms generation is finding out relevant documents to one specific or to a query that is fired. Occurrence of a term more times cannot indicate the power or potential to determine relevance. Thus their weight needs to be scaled down. We use idf : N idf t = 1 + log e df t Where t is the terms, N = total no. of documents and df t = no. of documents with t term.

8 So, For 3 documents: D1= Cricket is a game. Sam likes the game of cricket. D2= Do you play cricket? D3 = Playing any game is good for health. I play basketball. For D1, the idf values: Cricket is a game Sam likes the of Tf Normalise d tf 2/10 1/10 1/10 2/10 1/10 1/10 1/10 1/10 idf Log(3/2) Log(3/2) Log(3/1) Log(3/2) Log(3/1) Log(3/1) Log(3/1) Log(3/1)

9 Tf-idf To find relevant documents, generally a combined weighted approach is used called as tf-idf. So: w t,d = tf t,d idf t Representation of set of documents as vectors in common vector space is known as vector space model.

10 Calculating similarities between the documents Often cosine similarity is used. We are interested in determining the orthogonality.

11 More about Dot product: When we consider the dot product of two vectors say a. b, we are trying to project a into b. The angle between these vectors determines the orthagonality. If it is 90 degrees, the vectors are orthogonal.

12 Cosine similarity Cosine Similarity (d1, d2) = Dot product(d1, d2) / d1 * d2 Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] + + d1[n] * d2[n] d1 = square root(d1[0] 2 + d1[1] d1[n] 2 ) d2 = square root(d2[0] 2 + d2[1] d2[n] 2 )

13 What is to be done and how? To determine which document matches or is most similar to the input query from the given sets of documents 1. For every term present in the document calculate its term frequency. That is no. of times it is occurring in that document. This is to be done for every document. 2. Calculate idf for every term. Considering the entire corpus of documents, the idf is to be calculated for every term.

14 3. Now, given a query, Calculate tf*idf of those words only which are in the query for every document and So, if the words are game and cricket in query then Doc 1 Doc 2 Doc 3 Game Val 11 Val21 Val31 cricket val12 val22 Val32 4. Same way for the query we have to calculate the tf idf score. Now since we are assuming the query as game cricket, the tf will be 1 and 1. Normalized tf will be 0.5 and 0.5. What about idf? Here the idf of training data is considered directly. So tf*idf value will be 0.5 * idf (obtained in step 2) Let game tf*idf = Qval1 and cricket tf*idf= Qval2

15 Once we have the tf*idf values, every document now will be represented using the FV with just two terms. Doc 1 = Val11, val12 Doc 2 = val21, val22 Doc 3= val31, val32 Query = Qval1, Qval2 Cosine similarity between doc 1 and query = Step 1: dot product =(val11 * Qval1) + (val12 *Qval2) Step 2: Query = sqrt ((Qval1) 2 + (Qval2) 2 ) Doc 1 = sqrt((val11) 2 + (Val12) 2 ) Cosine value = dot product / Query * Doc1 * This way we need to calculate the similarity of all the documents with query and determine the match. * Value = 1 is exact match

16 Problem : D1: This is big classroom D2: Classroom has many benches D3: This is house D4: The house has garden D5: The house is big Big house Query Calculate cosine similarity and determine which document matches to the input query.

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics