Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Similar documents
D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Data Mining and Analysis: Fundamental Concepts and Algorithms

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Lecture Notes for Chapter 6. Introduction to Data Mining

CS5112: Algorithms and Data Structures for Applications

Mining Positive and Negative Fuzzy Association Rules

Association Rule Mining on Web

Association Rule. Lecturer: Dr. Bo Yuan. LOGO

Data mining, 4 cu Lecture 5:

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

10/19/2017 MIST.6060 Business Intelligence and Data Mining 1. Association Rules

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

15 Introduction to Data Mining

732A61/TDDD41 Data Mining - Clustering and Association Analysis

NetBox: A Probabilistic Method for Analyzing Market Basket Data

Machine Learning: Pattern Mining

COMP 5331: Knowledge Discovery and Data Mining

Frequent Itemset Mining

DATA MINING - 1DL360

Associa'on Rule Mining

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Data Mining Concepts & Techniques

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

DATA MINING - 1DL360

CS 584 Data Mining. Association Rule Mining 2

Pattern Structures 1

Mining Infrequent Patter ns

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Approximate counting: count-min data structure. Problem definition

Outline. Fast Algorithms for Mining Association Rules. Applications of Data Mining. Data Mining. Association Rule. Discussion

Frequent Itemset Mining

Handling a Concept Hierarchy

CS 484 Data Mining. Association Rule Mining 2

Introduction to Data Mining

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

CS246 Final Exam. March 16, :30AM - 11:30AM

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Assignment 7 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

CPDA Based Fuzzy Association Rules for Learning Achievement Mining

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Data Mining and Knowledge Discovery. Petra Kralj Novak. 2011/11/29

Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Data Warehousing & Data Mining

DATA MINING LECTURE 4. Frequent Itemsets and Association Rules

Association Analysis. Part 1

Frequent Itemset Mining

Chapters 6 & 7, Frequent Pattern Mining

Alternative Approach to Mining Association Rules

Interesting Patterns. Jilles Vreeken. 15 May 2015

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Model-based probabilistic frequent itemset mining

Association Rules. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION. Jones & Bartlett Learning, LLC NOT FOR SALE OR DISTRIBUTION

1. Data summary and visualization

On Differentially Private Frequent Itemsets Mining

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

Association Analysis: Basic Concepts. and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Association Rules. Acknowledgements. Some parts of these slides are modified from. n C. Clifton & W. Aref, Purdue University

CSE 5243 INTRO. TO DATA MINING

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Chapter 4: Frequent Itemsets and Association Rules

Sequential Pattern Mining

Data Warehousing & Data Mining

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Exercises, II part Exercises, II part

Bayesian Network Structure Learning and Inference Methods for Handwriting

Mining Rank Data. Sascha Henzgen and Eyke Hüllermeier. Department of Computer Science University of Paderborn, Germany

Machine Learning Overview

Density-Based Clustering

Summary. 8.1 BI Overview. 8. Business Intelligence. 8.1 BI Overview. 8.1 BI Overview 12/17/ Business Intelligence

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data

Association Analysis. Part 2

Count-Min Tree Sketch: Approximate counting for NLP

Lecture Notes for Chapter 6. Introduction to Data Mining. (modified by Predrag Radivojac, 2017)

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Frequent Pattern Mining: Exercises

Lecture 5: Web Searching using the SVD

CPSC340. Probability. Nando de Freitas September, 2012 University of British Columbia

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

Mining Class-Dependent Rules Using the Concept of Generalization/Specialization Hierarchies

Model Complexity of Pseudo-independent Models

Statistical Privacy For Privacy Preserving Information Sharing

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Data Analysis and Uncertainty Part 1: Random Variables

STA 584 Supplementary Examples (not to be graded) Fall, 2003

Apriori algorithm. Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK. Presentation Lauri Lahti

CHAPTER-17. Decision Tree Induction

Transcription:

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1

A Priori Algorithm for Association Rule Learning Association rule is a representation for local patterns in data mining What is an Association Rule? It is a probabilistic statement about the cooccurrence of certain events in the data base Particularly applicable to sparse transaction data sets 2

Examples of Patterns and Rules Supermarket 10 percent of customers buy wine and cheese Telecommunications If alarms A and B occur within 30 seconds of each other, then alarm C occurs within 60 seconds with probability 0.5 Weblog If a person visits the CNN website there is a 60% chance person will visit the ABC News website in the same month 3

Form of Association Rule Assume all variables are binary Association Rule has the form: If A=1 and B=1 then C=1 with probability p where A, B,C are binary variables and p = p(c=1 A=1,B=1) Conditional probability p is the accuracy or confidence of the rule p(a=1, B=1, C=1) is the support 4

Accuracy vs Support If A=1 and B=1 then C=1 with probability p= p(c=1 A=1,B=1) p(a=1, B=1, C=1) is the support Accuracy is a conditional probability Given that A and B are present what is the probability that C is present Support is a joint probability What is the probability that A,B and C are all present Example of three students in class 5

Goal of Association Rule Learning If A=1 and B=1 then C=1 with probability p= p(c=1 A=1,B=1) p(a=1, B=1, C=1) is the support Find all rules that satisfy the constraint that Accuracy p is greater than threshold p a Support is greater than threshold p s Example: Find all rules that satisfy the constraint that accuracy greater than 0.8 and support greater than 0.05 6

Association Rules are Patterns in Data If A=1 and B=1 then C=1 with probability p= p(c=1 A=1,B=1) p(a=1, B=1, C=1) is the support They are a weak form of knowledge They are summaries of co-occurrence patterns in data Rather than strong statements that characterize the population as a whole If-then-else here is inherently correlational and not causal 7

Origin of Association Rule Mining Applications involving market-basket data Data recorded in a database where each observation consists of an actual basket of items (such as grocery items) Association rules were invented to find simple patterns in such data in a computationally efficient manner 8

Basket Data Basket\Item A 1 A 2 A 3 A 4 A 5 t 1 1 0 0 0 0 t 2 1 1 1 1 0 t 3 1 0 1 0 1 t 4 0 0 1 0 0 t 4 0 1 1 1 0 t 5 1 1 1 0 0 t 6 1 0 1 1 0 For 5 items there will be 2 5 = 32 different baskets Set of baskets typically has a great deal of structure

Data matrix N rows (corresponding to baskets) and K columns (corresponding to items) N in the millions, K in tens of thousands Very sparse since typical basket contains few items 10

General Form of Association Rule Given a set of 0,1 valued variables A 1,..,A K a rule would have the form (( A i1 =1)^...^( A ik =1) ) A ik+1 =1 where 1 < i j < K for all j=1,..k Subscripts allow for any combination of variables in rule Can be written more briefly as Pattern such as ( A i1 ^...^A ) ik A ik+1 Is known as an itemset ( A i1 =1)^...^( A ik =1) 11

Frequency of Itemsets A rule is an expression of the form θ φ where θ is an itemset pattern and φ is an itemset pattern consisting of a single conjunct Frequency of itemset Given an itemset pattern θ its frequency fr(θ) is the number of cases in the data that satisfy θ Frequency fr(θ ^ φ) is the support Accuracy of the rule c(θ ϕ) = fr(θ ϕ) fr(θ) Conditional probability that φ is true given that θ is true Frequent Sets Given a frequency threshold s, all itemset patterns that are frequent 12

Example of Frequent Itemsets Basket\Item A 1 A 2 A 3 A 4 A 5 t 1 1 0 0 0 0 t 2 1 1 1 1 0 t 3 1 0 1 0 1 t 4 0 0 1 0 0 t 4 0 1 1 1 0 t 5 1 1 1 0 0 t 6 1 0 1 1 0 t 7 1 0 1 1 0 t 8 0 1 1 0 0 t 9 1 0 0 1 0 t 10 0 1 1 0 1 Frequent sets for threshold 0.4 are: {A 1 },{A 2 },{A 3 },{A 4 }, {A 1 A 3 },{A 2 A 3 } Rule A 1 A 3 has accuracy 4/6=2/3 Rule A 2 A 3 has accuracy 5/5=1 13

Association Rule Algorithm tuple 1. Task = description: associations between variables 2. Structure = probabilistic association rules (patterns) 3. Score Function = Threshold on accuracy and support 4. Search Method = Systematic search (breadth first with pruning) 5. Data Management Technique = multiple linear scans 14

Score Function If A=1 and B=1 then C=1 with probability p= p(c=1 A=1,B=1) p(a=1, B=1, C=1) is the support 1. Score function is a binary function (defined in 2) Two thresholds: p s is a lower bound on the support for the rule e.g., p s =0.1 want only rules that cover at least 10% of the data p a is a lower bound on the accuracy of the rule e.g., p a =0.9 want only rules that are 90% accurate 2. A pattern gets a score of 1 if it satisfies both threshold conditions and a score of 0 otherwise 3. Goal is to find all rules (patterns) with a score of 1 15

Search Problem Searching for all rules is formidable problem Exponential number of association rules O(K2 K-1 ) for binary variables if we limit ourselves to rules with positive propositions (e.g., A=1) in left- and right- hand sides Taking advantage of nature of score function can reduce run-time 16

Reducing Average Search Run-Time If A=1 and B=1 then C=1 with probability p= p(c=1 A=1,B=1) p(a=1, B=1, C=1) is the support Observation: If either p(a=1) < p s or p(b=1) < p s then p(a=1,b=1) < p s First find all events (such as A=1) that have probability greater than p s. This is a frequent set. Consider all possible pairs of these frequent events to be candidate frequent sets of size 2

Frequent Sets Going from frequent sets of size k-1 to frequent sets of size k, we can prune any sets of size k that contain a subset of k-1 items that are not frequent E.g., If we had frequent sets {A=1,B=1} and {B=1,C=1} they can be combined to get k=3 set {A=1,B=1,C=1} However, if {A=1,B=1} is not frequent then {A=1,B=1,C=1} is not frequent either and it could be safely pruned Pruning can take place without searching the data directly This is the a priori property 18

A priori Algorithm Operation Given a pruned list of candidate frequent sets of size k Algorithm performs another linear scan of the database to determine which of these sets are in fact frequent Confirmed frequent sets of size k are combined to generate possible frequent sets containing k+1 events followed by another pruning etc Cardinality of largest frequent set is quite small (relative to n) for large support values Algorithm makes one last pass through data set to determine which subset combination of frequent sets also satisfy the accuracy threshold 19

Summary: Association Rule Algorithms Search and Data Management are most critical components Use a systematic breadth-first general-to-specific search method that tries to minimize number of linear scans through the database Unlike machine learning algorithms for rule-based representations, they are designed to operate on very large data sets relatively efficiently Papers tend to emphasize computational efficiency rather than interpretation of the rules produced 20

Vector Space Algorithms for Text Retrieval Retrieval by content Query object and a large database of objects Find k objects in database that are similar to query 21

Text Retrieval Algorithm How is similarity defined? Text documents are of different length and structure Key idea: Reduce all documents to a uniform vector representation as follows: Let t 1,.., t p be p terms (words, phrases, etc) These are the variables or columns in data matrix 22

Vector Space Representation of Documents A document (a row in data matrix) is represented by a vector of length p Where the i th component contains the count of how often term t i appears in the document In practice, can have a very large data matrix n in millions, p in tens of thousands Sparse matrix Instead of a very large n x p matrix, store a list for each term t i of all documents containing the term 23

Similarity of Documents Similarity distance is a function of the angle between two vectors in p-space Angle measures similarity in term space and factors out any differences arising from fact that large documents have many occurrences of a word than small documents Works well -- many variations on this theme 24

Text Retrieval Algorithm tuple 1. Task = retrieval of k most similar documents in a database relative to a given query 2. Representation = vector of term occurences 3. Score function = angle between two vectors 4. Search method = various techniques 5. Data Management Technique = various fast indexing strategies 25

Variations of TR Components In defining score function, we can specify similarity metrics more general than angle function In specifying search method, various heuristic techniques possible Real time search since algorithm has to retrieve patterns in real time for a user (unlike other data mining algorithms meant for off-line searching for optimal parameters and model structures) 26

Text Retrieval Variations In searching legal documents, absence of particular terms might be significant reflect this in score function Another context, down-weight the fact that certain terms are missing in two documents relative what they have in common 27