Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

Similar documents
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning for natural language processing

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Least Squares Classification

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Classification: Analyzing Sentiment

Performance evaluation of binary classifiers

Classification: Analyzing Sentiment

LEARNING AND REASONING ON BACKGROUND NETS FOR TEXT CATEGORIZATION WITH CHANGING DOMAIN AND PERSONALIZED CRITERIA

Data Mining algorithms

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Online Passive-Aggressive Algorithms. Tirgul 11

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Anomaly Detection. Jing Gao. SUNY Buffalo

Performance Metrics for Machine Learning. Sargur N. Srihari

Dimensionality reduction

CSC 411: Lecture 03: Linear Classification

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Classification, Linear Models, Naïve Bayes

CS 188: Artificial Intelligence. Outline

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

CS249: ADVANCED DATA MINING

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

how to *do* computationally assisted research

Ontology-Based News Recommendation

CS540 ANSWER SHEET

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING

Benchmarking Non-Parametric Statistical Tests

Classification using stochastic ensembles

Performance Evaluation

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Time Series Classification

Boolean and Vector Space Retrieval Models

Mejbah Alam. Justin Gottschlich. Tae Jun Lee. Stan Zdonik. Nesime Tatbul

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Information Retrieval and Web Search

Machine Learning for natural language processing

Overview of IslandPick pipeline and the generation of GI datasets

A Survey on Spatial-Keyword Search

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Performance Evaluation

Data Mining and Knowledge Discovery: Practice Notes

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CS 188: Artificial Intelligence Spring Announcements

Variable Latent Semantic Indexing

Classification Algorithms

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Evaluation & Credibility Issues

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Machine Learning Concepts in Chemoinformatics

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Jeffrey D. Ullman Stanford University

MATH 567: Mathematical Techniques in Data Science Categorical data

Iterative Laplacian Score for Feature Selection

Anomaly Detection for the CERN Large Hadron Collider injection magnets

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization

University of Illinois at Urbana-Champaign. Midterm Examination

Significance Tests for Bizarre Measures in 2-Class Classification Tasks

Modern Information Retrieval

Naïve Bayes, Maxent and Neural Models

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Model Accuracy Measures

CS6220: DATA MINING TECHNIQUES

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Prediction of Citations for Academic Papers

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Efficient Classification of Multi-label and Imbalanced Data Using Min-Max Modular Classifiers

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Mining Classification Knowledge

The prediction of membrane protein types with NPE

DATA MINING LECTURE 10

CS145: INTRODUCTION TO DATA MINING

naive bayes document classification

Click Prediction and Preference Ranking of RSS Feeds

Kernel Density Topic Models: Visual Topics Without Visual Words

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Multi-theme Sentiment Analysis using Quantified Contextual

Fast Comparison of Software Birthmarks for Detecting the Theft with the Search Engine

Evaluation. Andrea Passerini Machine Learning. Evaluation

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Chap 2: Classical models for information retrieval

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Transcription:

Email Classification Team Ravana Team Members Cliffton Fernandes Nikhil Keswaney

Hello! Cliffton Fernandes MS-CS Nikhil Keswaney MS-CS 2

Topic Area Email Classification Spam: Unsolicited Junk Mail Ham : Desired emails and not spam. Accuracy improves with increase in processed data. Enron Corpus ( 500k Emails) Large amount of computational resources and time. Parallel computing can be used to reduce the amount of time. 3

Approach K - Nearest Neighbours Algorithm. Classification algorithm. Example of KNN. 4

Literature Search Paper 1 Title: A Novel Method for Detecting Spam Email using KNN Classification with Spearman Correlation as Distance Measure. Authors: Ajay Sharma, Anil Suryawanshi. International Journal of Computer Applications Year of Publication: 2016 Link:https://www.ijcaonline.org/archives/volume13 6/number6/24159-2016908471 5

Problem Addressed K-NN finds a similarity measure between the objects for classification. Finding the right similarity measure effects the accuracy and speed of the classification. Comparative study of Euclidean distance vs Spearman's Coefficient. 6

Novel Contributions 7

Similarity Measures Euclidean Distance Spearman s Coefficient X 1 = ( x 11, x 12,.. x 1n ) X 2 = ( x 21, x 22,.. x 2n ) n = Number of objects. 8

Evaluation Metrics F- Measure Precision Recall Accuracy TP : True Positive TN : True Negative FP : False Positive FN : False Negative 9

10

11

12

13

Applications to our Investigation The paper concludes that the Spearman s coefficient is a better performing similarity coefficient for the K- NN algorithm. Therefore we will use the Spearman s coefficient in our project. 14

Literature Search Paper 2 Title: KNN with TF-IDF Based Framework for Text Categorization. Authors: Bruno Trstenjak, Sasa Mikac, Dzenana Donko. 24th DAAAM International Symposium on Intelligent Manufacturing and Automation, 2013 Year of Publication: 2013 Link:https://doi.org/10.1016/j.proeng.2014.03.129 15

Problems Addressed Implementing KNN (K nearest Neighbour) using TF-IDF (Term Frequency - Inverse Document Frequency) for weight matrix calculation. Benchmarking the implemented technique. 16

Novel Contributions TF-IDF (Term Frequency - Inverse Document Frequency) What is TF-IDF? TDIDF is a numerical statistic method which allows the determination of weight for each term in each document. This method is often used in NLP or in information retrieval and text mining. This method evaluates importance of terms (Words) in document collection. The importance of the text is increased proportionally to the number of appearing in the documents. 17

Novel Contributions TF-IDF (Term Frequency - Inverse Document Frequency) Term Frequency The method computes the frequency of a particular word i in that particular document j. f ij = Frequency of term i in document j. tf ij = Term Frequency of i in document j. f ij tf ij = max {f ij } 18

Novel Contributions TF-IDF (Term Frequency - Inverse Document Frequency) Inverse Document Frequency IDF method determines the relative frequency of words in a specific document through an inverse proportion of the word over the entire document corpus idf i = Inverse doccument frequency. df i = Doccument Frequency of term i. N = number of documents. N idf i = log 2 df i 19

Novel Contributions TF-IDF (Term Frequency - Inverse Document Frequency) f ij = Frequency of term i in document j. tf ij = Term Frequency of i in document j. idf ij = Inverse document frequency. df ij = Document Frequency of term i in document j. N = number of documents. (TF IDF) ij = f ij max f ij log 2 N df i 20

Novel Contributions TF-IDF (Term Frequency - Inverse Document Frequency) Benchmarks 21

Novel Contributions TF-IDF (Term Frequency - Inverse Document Frequency) Benchmarks The last test was focused on measuring the accuracy of classification depending on the document category. The authors suspected that it was because of the preprocessing of the document i.e. removing undesired characters and words. 22

23

Applications to our Investigation Term Binary (As explained in our previous presentation) Spam = { cheap : 40, purchase: 33, } Ham = {contract: 20, submitted: 11, } Email Cheap Purchase Contract... Submitted Answer 1 1 1 0... 0 Spam 2 0 0 0... 1 Ham 3 0 0 1... 0 Ham 4 1 0 0 0 Spam..................... 500000 0 1 0... 0 Spam 24

Literature Search Paper 3 Title: Improved KNN Text Classification Algorithm with MapReduce Implementation. Authors: Yan Zhao, Yun Qian, Cuixia Li. The 2017 4th International Conference on Systems and Informatics (ICSAI 2017) Year of Publication: 2017 Link:https://ieeexplore.ieee.org/document/8248509 25

Problems Addressed Implementing an improved TF-IDF (Term Frequency - Inverse Document Frequency) for weight matrix calculation. Implement the KNN algorithm with the improved TF- IDF using Map-Reduce. Benchmarking the implemented technique 26

Novel Contributions Improvements In order to reduce the matrix size they only selected P words with the highest TF-IDF scores and removing the duplicate words. In M documents the matrix created would be of size M P. The words with the highest scores are called Feature vectors. 27

Novel Contributions Traditional TF-IDF considering i as the word in the j document. f ij N (TF IDF) ij = log 2 max f ij df i Improvements When we calculate the IDF factor of a word it reduces the value of the word if it occurs a lot in all the documents. What if those words occur in more concentration in one category? They included a factor DF i = Discrimination Factor of a word i to increase the weight of the word that should get increased T = Total number of words in this category. m i = numer of the text in the category. DF i = log 2 T m i (TF IDF DF) ij = (TF IDF) ij DF i 28

Novel Contributions Improvements Some words are distributed more evenly across the dataset For example if China appears 9 times in spam and 10 times in ham then that should be eliminated while making feature vectors But because of TF-IDF it will have a high value and would be included. So they used the below standard deviation formulae to remove all the evenly distributed words. From the feature vector. G = Total number of categories. x j = numer of times that word occured in that category. μ = Sum of times that word occured in all categories. G σ = 2 1 G j=1 x j μ G μ 29

Novel Contributions Implementation in MAP-REDUCE. KNN is a widely used text classification data but when faced with enormous data the sequential program seems to be inadequate. So they used MAP-Reduce. The basic Idea of the MAP-Reduce version is as follows 1. Calculate the (TF-IDF-DF) score matrix According to the formulae. 2. Obtain Feature Vectors. 3. Vectorize training sets. MAP: Calculate the distance of that vector with all the training data vectors. REDUCE: Merge the results of the MAP outputs find the K nearest neighbours and assign the category that it has maximum neighbours with. 30

Novel Contributions Benchmarks 31

Applications to our Investigation Since the Categories we are making are only 2 (Spam and Ham)so the standard deviation formulae becomes 2 σ = 2 1 2 j=1 x j μ G μ The Weight Matrix that we are calculation will include the modifications. The implementation of the Parallel algorithm will be using the MAP-REDUCE functionality with modifications Modifications: The distance calculation of the emails will be done using the Spearman's Coefficient. 32

Thanks! Any questions? 33