Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Similar documents
Chapter 6 Classification and Prediction (2)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Modern Information Retrieval

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Part 4. Prediction

Naïve Bayesian. From Han Kamber Pei

Modern Information Retrieval

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Mining Classification Knowledge

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

CS6220: DATA MINING TECHNIQUES

Gaussian Models

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Chapter 6: Classification

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

CHAPTER-17. Decision Tree Induction

CS145: INTRODUCTION TO DATA MINING

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

DATA MINING LECTURE 10

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Bayesian Classification. Bayesian Classification: Why?

CS570 Introduction to Data Mining

15 Introduction to Data Mining

Click Prediction and Preference Ranking of RSS Feeds

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Classification and Prediction

Course in Data Science

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Assignment No A-05 Aim. Pre-requisite. Objective. Problem Statement. Hardware / Software Used

Machine Learning 2017

CS 6375 Machine Learning

CS145: INTRODUCTION TO DATA MINING

The Bayes classifier

Mining Classification Knowledge

Jeff Howbert Introduction to Machine Learning Winter

Brief Introduction to Machine Learning

CS145: INTRODUCTION TO DATA MINING

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

day month year documentname/initials 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning for pervasive systems Classification in high-dimensional spaces

CS6220: DATA MINING TECHNIQUES

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Introduction to SVM and RVM

Unsupervised Anomaly Detection for High Dimensional Data

CSE 5243 INTRO. TO DATA MINING

Machine Learning for Signal Processing Bayes Classification and Regression

Introduction to Machine Learning Midterm Exam

Brief Introduction of Machine Learning Techniques for Content Analysis

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Bayesian Networks Inference with Probabilistic Graphical Models

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Decision T ree Tree Algorithm Week 4 1

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Estimating Parameters

Lecture 7 Decision Tree Classifier

CS6220: DATA MINING TECHNIQUES

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Support Vector Machine. Industrial AI Lab.

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning for Signal Processing Bayes Classification

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Decision Tree And Random Forest

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Machine Learning for NLP

Generative Clustering, Topic Modeling, & Bayesian Inference

CSE 546 Final Exam, Autumn 2013

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

FINAL: CS 6375 (Machine Learning) Fall 2014

Lecture VII: Classification I. Dr. Ouiem Bchir

Classification Using Decision Trees

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Machine Learning

Support Vector Machines

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Chap 2: Classical models for information retrieval

Generative v. Discriminative classifiers Intuition

CS6220: DATA MINING TECHNIQUES

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

EECS 349:Machine Learning Bryan Pardo

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Statistical Learning. Philipp Koehn. 10 November 2015

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Decision Trees (Cont.)

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Induction of Decision Trees

Modern Information Retrieval

Similarity and Dissimilarity

Introduction to Machine Learning Midterm Exam Solutions

Transcription:

Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University

Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data Part Three: Feature Selection for Text Data Part Four: Text Mining Algorithms Part Five: Software Tools for Text Mining

Introduction: Data Mining Also known as KDD - Knowledge Discovery from Databases Data mining : Knowledge mining from data. Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.

Knowledge Discovery (KDD) Process Data mining core of knowledge discovery process Task-relevant Data Data Mining Pattern Evaluation Data Warehouse Selection Data Cleaning Data Integration Databases 4

Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA 5

Data Mining Functionalities Class Description Association Analysis Classification and Prediction Cluster Analysis Outlier Analysis Trend and Evolution Analysis 6

Part One: Text Mining The process of analyzing text to extract information that is useful for particular purposes. The information to be extracted is clearly and explicitly in the text. Automatic processing text data sets is a challenge for Text mining research.

Text Mining Text Classification Text Clustering Text Summarization Text Retrieval

Challenge of Text Mining Research Exploring features of text documents Document representation Documents similarity measurement Reduction of high dimension Development of new high performance text mining algorithms which target the unique features of text documents.

Part Two: Preprocessing Text Data Text data is unstructured, amorphous, and difficult to deal with algorithmically. Document representation: Vector space model: a bag of words. Models with word sequence information Generalized Suffix Tree

Basic Preprocessing Techniques Tokenization Stopwords removal Stemmer Weight of words: TF/IDF Normalization

Alternative Document Representation The document model better preserves the sequential relationship between words in the document since the accurate meaning of a sentence has close relationship with it. Word meanings are better than word forms to represent the topics of documents. Ontology should be used.

Generalized Suffix Tree young boys play basketball basketball 1 2 basketball 2 2 boys play basketball 3 2 play 4 1,3 5 6 7 1 1,3 1,3 Node No. Word Sequence Length of Word Seq. Document Id Set Number of Document Ids 1 young boys play 3 1,2 2 2 boys play 2 1,2,3 3 3 play 1 1,2,3 3 4 basketball 1 1,3 2 5 young boys play basketball 4 1 1 6 boys play basketball 3 1,3 2 7 play basketball 2 1,3 2

GST of the Original Documents young boys all boys play basketball almost all boys play basketball 17 1 boys like to play basketball like to play basketball 2 play like to play basketball to play basketball play 6 football 3 16 1,3 3 Play Football 9 10 11 football 18 19 12 3 4 5 basket ball 1 1 football 13 14 basket ball Half of young boys play football basketball of young boys play football 7 8 15 2 1 2 1 2 3 2 1,3 2 2

Investigating Word Meaning with WordNet WordNet is an on-line lexical reference system hosted by Princeton University. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Illustration of the concept of a Lexical Matrix: Word Meanings M1 M2 M3 : : Mm Word Forms F1 F2.. Fn E1,1 E1,2 E2,2.. Em,n

Preprocessing by Using the WordNet For each word form, retrieve multiple synsets (synonym sets) and hypernyms from the WordNet to build meaning unions (MU) with synset links (SLs). For a word form box : MU = {SS 1 -> SS 2, SS 3 ->SS 4 }; SS 1 = {box}, SS 2 = {container}; SS 3 = {box, loge}, SS 4 = {compartment}; A document becomes d'=<mu 1, MU 2, MU 3, >, which is treated as a string of word meanings.

Part Three: Feature Selection for Text Data Feature selection removes irrelevant or redundant features. The selected feature subset contains more sufficient or reliable information about the original data set. At the same time, it reduces the dimensions of the database. Widely used methods Supervised : Information Gain (IG), χ2 Statistic (CHI). Unsupervised : Document Frequency (DF), Term Strength (TS).

Attribute Selection Method: Information Gain Select the attribute with the highest information gain This attribute minimizes the information needed to classify the tuples in the resulting partitions. Let p i be the probability that an arbitrary tuple in D belongs to class C i, estimated by C i, D / D Expected information (entropy) needed to classify a tuple in D: m Info( D) p i log2( pi ) i1 Entropy represents the average amount of information needed to identify the class label of a tuple in D. CISC 4631 18

Supervised Feature Selection Method - CHI χ 2 term-category independency test: E( i, 2 w, c j) O( a, j) a{ w, w) b{ c, c) i{ w, w} j{ c, c} N O( i, b) ( O( i, j) E( i, E( i, j) j)) 2 c c Σ w 40 80 120 w 60 320 380 Σ 100 400 500 By ranking χ 2 statistic, CHI method selects features that have strong dependency on the categories. For m-ary classifier : 2 max ( w) m 2 max{ ( w, c k1 A 2-way contingency table 2 w, c k )} 17.61

Part Four: Text Mining Algorithms Text Classification Naïve Bayes: Benchmark Support Vector Machine Text Clustering K-means

Text Classification Supervised learning Supervision: The training data (observations, measurements, etc.) for model construction are accompanied by labels indicating the class of the observations The model is represented as classification rules, decision trees, or mathematical formulae New data is classified based on the model

Bayesian Theorem: Basics Let X be a data sample (evidence) with class label unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H X), (posteriori probability), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability, which is independent of X. E.g., A customer will buy computer, regardless of age, income, P(X): probability that sample data is observed P(X H) (likelihood), the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X age is 31..40, medium income

Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H X), follows the Bayes theorem P( H X) P( X H) P( H) P( X) Informally, this can be written as posteriori = likelihood x prior/evidence Predicts X belongs to Ci iff the probability P(Ci X) is the highest among all the P(C k X) for all the k classes

Towards Naïve Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-d attribute vector X = (x 1, x 2,, x n ) Suppose there are m classes C 1, C 2,, C m. Classification is to derive the maximum posteriori, i.e., the maximal P(C i X) This can be derived from Bayes theorem Since P(X) is constant for all classes, only needs to be maximized P( X C ) P( C ) P( C X) i i i P( X) P( C X) P( X C ) P( C ) i i i

Derivation of Naïve Bayes Classifier Naïve: A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): n P( X Ci) P( x k 1 k Ci) P( x 1 Ci) P( x Ci)... P( x Ci) This greatly reduces the computation cost: Only counts the class distribution If Attribute A k is categorical, P(x k C i ) is the # of tuples in C i having value x k for A k divided by C i, D (# of tuples of C i in D) If A k is continous-valued, P(x k C i ) is usually computed based on 2 Gaussian distribution with a mean μ and standard deviation ( x ) 1 σ: 2 2 probability density function g( x,, ) e 2 2 n and P(x k C i ) is P( X Ci) g( x k, C i, C ) i

SVM Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training data into a higher dimension With the new dimension, it searches for the linear optimal separating hyperplane (i.e., decision boundary ) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane SVM finds this hyperplane using support vectors ( essential training tuples) and margins (defined by the support vectors)

SVM General Philosophy Small Margin Large Margin Support Vectors

SVM When Data Is Linearly Separable m Let data D be (X 1, y 1 ),, (X D, y D ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)

Text Clustering Text clustering is known as unsupervised, automatic grouping of text documents into conceptually meaningful clusters, so that documents within a cluster have a high similarity among them, but they are dissimilar to documents in other clusters. It is different from text classification because of the lack of labeled documents for training.

Text Clustering Documents Feature Selection / Extraction Document Representation Document Similarity Measure Grouping Clusters Feedback Loop Typical text clustering activity involves these three steps. Cluster validity analysis Internal Quality Measure : Overall Similarity External Quality Measure : F-measure, Entropy, Purity

The K-Means Clustering Method Centroid of a cluster for numerical values: the mean value of all objects in a cluster Given k, the k-means algorithm is implemented in four steps: 1. Select k seed points from D as the initial centroids. 2. Assigning: 3. Updating: C m N ( t i 1 N Assign each object of D to the cluster with the nearest centroid. Compute centroids of the clusters of the current partition. 4. Go back to Step 2 and continue, stop when no more new assignment. ip ) CISC 4631 31 31

The K-Means Clustering Method Example 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Assign each objects to most similar center 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign Update the cluster means 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 reassign K=2 Arbitrarily choose K object as initial cluster center 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 0 0 1 2 3 4 5 6 7 8 9 10 CISC 4631 32 32 10 9 8 7 6 5 4 3 2 1

Cosine Similarity Terms in document are represented as Vector objects. Cosine measure: If d 1 and d 2 are two vectors, then cos(d 1, d 2 ) = (d 1 d 2 ) / d 1 d 2, where indicates vector dot product, d : the length of vector d Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 0 0 0 1 0 2 d 1 d 2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5 d 1 = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 =(42) 0.5 = 6.481 d 2 = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 =(6) 0.5 = 2.245 cos( d 1, d 2 ) =.3150 33

Cosine Function May Not Work A document may have a small subset of terms from a large vocabulary of the topic. For example, topic and subtopic; Synonyms usage. The vocabulary sizes for different topics are different. A cluster with a large vocabulary will be forced to split to increase the value of criterion function. cos( d i, d j ) d d i t i d j d j d t i d j

Neighbor Matrix The neighbors of a document d are those considered similar to it. sim( d i, d j ), [0,1] Neighbor Matrix: n x n adjacency matrix M Link is the number of common neighbors between two document d i and d j. link( d, d i j ) n m1 ( M[ i, m] M[ m, j]) d 1 d 2 d 3 d 4 d 5 d 6 d 1 1 1 0 1 0 0 d 2 1 1 0 1 0 0 d 3 0 0 1 1 1 0 d 4 1 1 1 1 0 0 d 5 0 0 1 0 1 0 d 6 0 0 0 0 0 1

Similarity Measurement Function CL (Cosine and link) Global information given by the neighbor matrix can be combined with the pair-wise similarity measurement (cosine function) to solve those problems. link( d i, c j n ) M '[ i, m] M m1 [ m, n Experimental results show that the range of coefficient a is 0.8 ~ 0.95. link( di, c j ) f ( di, c j ) a (1 a) cos( d N ' max j] d 1 d d n c 1 c c k d 1 1 1 0 1 0 0 d 2 1 1 0 1 0 0 d 3 0 0 1 1 1 0 d 4 1 1 1 1 0 0 d 0 0 1 0 1 0 d n 0 0 0 0 0 1 i, c j ), a [0,1]

Neighbors and Links Concepts B Cluster C 1 Cluster C 2 Cluster C 3 C A D

Part Five: Software Tools for Text Mining Lemur in C++ https://www.lemurproject.org/ University of Massachusetts at Amherst and Carnegie Mellon University, Weka in Java http://www.cs.waikato.ac.nz/ml/weka/ The university of Waikato

Questions?