9. Business Intelligence Data Warehousing & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 9. Business Intelligence 9.1 Business Intelligence Overview 9.2 Principles of Data Mining DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 2 What is Business Intelligence (BI)? The process, technologies and tools needed to turn data into information, information into knowledge and knowledge into plans that drive profitable business action BI comprises data warehousing, business analytic tools, and content/knowledge management Typical BI applications are Customer segmentation Propensity to buy (customer disposition to buy) Customer profitability Fraud detection Customer attrition (loss of customers) Channel optimization (connecting with the customer) DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 3 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 4 Customer segmentation What market segments do my customers fall into, and what are their characteristics? Personalize customer relationships for higher customer satisfaction and retention Propensity to buy Which customers are most likely to respond to my promotion? Target the right customers Increase campaign profitability by focusing on the customers most likely to buy DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 5 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 6 1
Customer profitability What is the lifetime profitability of my customer? Make individual business interaction decisions based on the overall profitability of customers Fraud detection How can I tell which transactions are likely to be fraudulent? If your wife has just proposed to increase your life insurance policy, you should probably order pizza for a while Quickly determine fraud and take immediate action to minimize damage DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 7 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 8 Customer attrition Which customer is at risk of leaving? Prevent loss of high-value customers and let go of lower-value customers Channel optimization What is the best channel to reach my customer in each segment? Interact with customers based on their preference and your need to manage cost Automated decision tools Rule-based systems that provide a solution usually in one functional area to a specific repetitive management problem in one industry E.g., automated loan approval, intelligent price setting Business performance management (BPM) A framework for defining, implementing and managing an enterprise s business strategy by linking objectives with factual measures - key performance indicators DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 9 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 10 Dashboards Provide a comprehensive visual view of corporate performance measures, trends, and exceptions from multiple business areas Allows executives to see hot spots in seconds and explore the situation 9.2 Data Mining What is data mining (knowledge discovery in databases)? Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 11 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 12 2
9.2 Applications 9.2 Applications Market analysis Targeted marketing/ Customer profiling Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Cross-market analysis Associations/co-relations between product sales Prediction based on the association of information Corporate analysis and risk management Finance planning and asset evaluation Cash flow analysis and prediction Trend analysis, time series, etc. Resource planning Summarize and compare the resources and spending Competition Monitor competitors and market directions Group customers into classes and a class-based pricing procedure Set pricing strategy in a highly competitive market DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 13 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 14 9.2 Data Mining Architecture of DM systems Graphical user interface ETL Pattern evaluation Data mining engine Database or data warehouse server Filtering Knowledge-base 9.2 Data Mining Techniques Association (correlation and causality) Multi-dimensional vs. single-dimensional association age(x, 20..29 ), income(x, 20..29K ) buys(x, PC ) [support = 2%, confidence = 60%] contains(t, computer ) contains(x, software ) [1%, 75%] Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future predictions Presentation: decision-tree, classification rule, neural network Prediction: predict some unknown or missing numerical values Databases Data Warehouse DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 15 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 16 9.2 Data Mining Techniques Cluster analysis Class label is unknown: group data to form new classes, e.g., advertising based on client groups Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Outlier analysis Outlier: a data object that does not comply with the general behavior of the data Can be considered as noise or exception, but is quite useful in fraud detection, rare events analysis Association rule mining has the objective of finding all co-occurrence relationships (called associations), among data items Classical application: market basket data analysis, which aims to discover how items are purchased by customers in a supermarket E.g., Cheese Wine [support = 10%, confidence = 80%] meaning that 10% of the customers buy cheese and wine together, and 80% of customers buying cheese also buy wine DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 17 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 18 3
Basic concepts of association rules Let I = {i 1, i 2,, i m } be a set of items. Let T = {t 1, t 2,, t n } be a set of transactions where each transaction t i is a set of items such that t i I. An association rule is an implication of the form: X Y, where X I, Y I and X Y = Association rule mining market basket analysis example I set of all items sold in a store E.g., i 1 = Beef, i 2 = Chicken, i 3 = Cheese, T set of transactions The content of a customers basket E.g., t 1 : Beef, Chicken, Milk; t 2 : Beef, Cheese; t 3 : Cheese, Wine; t 4 : An association rule might be Beef, Chicken Milk, where {Beef, Chicken} is X and {Milk} is Y DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 19 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 20 Rules can be weak or strong The strength of a rule is measured by its support and confidence The support of a rule X Y, is the percentage of transactions in T that contains X and Y Can be seen as an estimate of the probability Pr({X,Y} t i ) With n as number of transactions in T the support of the rule X Y is: support = {i {X, Y} t i } / n The confidence of a rule X Y, is the percentage of transactions in T containing X, that contain X Y Can be seen as estimate of the probability Pr(Y t i X t i ) confidence = {i {X, Y} t i } / {j X t j } DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 21 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 22 How do we interpret support and confidence? If support is too low, the rule may just occur due to chance Acting on a rule with low support may not be profitable since it covers too few cases If confidence is too low, we cannot reliably predict Y from X Objective of mining association rules is to discover all associated rules in T that have support and confidence greater than a minimum threshold (minsup, minconf)! Finding rules based on support and confidence thresholds Transactions Let minsup = 30% and minconf = 80% T1 T2 Beef, Chicken, Milk Beef, Cheese T3 Cheese, Boots Chicken, Clothes Milk T4 Beef, Chicken, Cheese is valid, [sup = 3/7 T5 (42.84%), conf = 3/3 (100%)] T6 T7 Clothes, Chicken, Milk Chicken, Milk, Clothes Clothes Milk, Chicken is also valid, and there are more Beef, Chicken, Clothes, Cheese, Milk DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 23 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 24 4
This is rather a simplistic view of shopping baskets Some important information is not considered e.g. the quantity of each item purchased, the price paid, There are a large number of rule mining algorithms They use different strategies and data structures Their resulting sets of rules are all the same Approaches in association rule mining Apriori algorithm Mining with multiple minimum supports Mining class association rules The best known mining algorithm is the Apriori algorithm Step 1: find all frequent itemsets (set of items with support minsup) Step 2: use frequent itemsets to generate rules DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 25 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 26 Step 1: frequent itemset generation The key is the apriori property (downward closure property): any subset of a frequent itemset is also a frequent itemset E.g., for minsup = 30% Chicken, Clothes, Milk Chicken, Clothes Chicken, Milk Clothes, Milk Chicken Clothes Milk T1 T2 T3 T4 T5 T6 T7 Transactions Beef, Chicken, Milk Beef, Cheese Cheese, Boots Beef, Chicken, Cheese Beef, Chicken, Clothes, Cheese, Milk Clothes, Chicken, Milk Chicken, Milk, Clothes Finding frequent items Find all 1-item frequent itemsets; then all 2-item frequent itemsets, etc. In each iteration k, only consider itemsets that contain a k-1 frequent itemset Optimization: the algorithm assumes that items are sorted in lexicographic order The order is used throughout the algorithm in each itemset {w[1], w[2],, w[k]} represents a k-itemset w consisting of items w[1], w[2],, w[k], where w[1] < w[2] < < w[k] according to the lexicographic order DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 27 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 28 9.3 Finding frequent items Initial step Find frequent itemsets of size 1: F 1 Generalization, k 2 C k = candidates of size k: those itemsets of size k that could be frequent, given F k-1 F k = those itemsets that are actually frequent, F k C k (need to scan the database once) Generalization of candidates uses F k-1 as input and returns a superset (candidates) of the set of all frequent k-itemsets. It has two steps: Join step: generate all possible candidate itemsets C k of length k, e.g., I k = join(a k-1, B k-1 ) A k-1 = {i 1, i 2,, i k-2, i k-1 } and B k-1 = {i 1, i 2,, i k-2, i k-1 } and i k-1 < i k-1 ; Then I k = {i 1, i 2,, i k-2, i k-1, i k-1 } Prune step: remove those candidates in C k that do not respect the downward closure property (include k-1 non-frequent subsets) DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 29 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 30 5
Generalization e.g., F 3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} Try joining each 2 candidates from F 3 {1, 2, 3} {1, 3, 4} {1, 2, 4} {1, 3, 4} {1, 3, 5} {2, 3, 4} {2, 3, 4} {1, 2, 3, 4} {1, 3, 5} {1, 3, 4, 5} {1, 2, 4} {1, 3, 4} {1, 3, 5} {2, 3, 4} {1, 3, 5} {2, 3, 4} After join C 4 = {{1, 2, 3, 4}, {1, 3, 4, 5}} Pruning: {1, 2, 3} {1, 2, 3, 4} {1, 2, 4} {1, 3, 4} F 3 {1, 2, 3, 4} is a good candidate {1, 3, 4, 5} {2, 3, 4} {1, 3, 4} {1, 3, 5} {1, 4, 5} {3, 4, 5} After pruning C 4 = {{1, 2, 3, 4}} F 3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} F 3 {1, 3, 4, 5} Removed from C 4 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 31 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 32 Complete example: find frequent items, for minsup = 0.5 First T scan ({item}:count) C 1 : {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F 1 : {1}:2, {2}:3, {3}:3, {5}:3; {4} has a support of ¼ < 0.5 so it does not belong to the frequent items C 2 = prune(join(f 1 )) join : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}; prune: C 2 : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}; (all items belong to F 1 ) TID Items T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5 T400 2, 5 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 33 Second T scan T100 1, 3, 4 C 2 : {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, T200 2, 3, 5 {3,5}:2 T300 1, 2, 3, 5 T400 2, 5 F 2 : {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 Join: we could join {1,3} only with {1,4} or {1,5}, but they are not in F 2. The only possible join in F 2 is {2, 3} with {2, 5} resulting in {2, 3, 5}; prune({2, 3, 5}): {2, 3}, {2, 5}, {3, 5} all belong to F 2, hence, C 3 : {2, 3, 5} Third T scan {2, 3, 5}:2, then sup({2, 3, 5}) = 50%, minsup condition is fulfilled. Then F 3 : {2, 3, 5} TID Items DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 34 9.3 Apriori Algorithm: Step 2 Step 2: generating rules from frequent itemsets Frequent itemsets are not the same as association rules One more step is needed to generate association rules: for each frequent itemset I, for each proper nonempty subset X of I: Let Y = I \ X; X Y is an association rule if: Confidence(X Y) minconf, Support(X Y) := {i {X, Y} t i } / n = support(i) Confidence(X Y) := {i {X, Y} t i } / {j X t j } = support(i) / support(x) 9.3 Apriori Algorithm: Step 2 Rule generation example, minconf = 50% Suppose {2, 3, 5} is a frequent itemset, with sup=50%, as calculated in step 1 Proper nonempty subsets: {2, 3}, {2, 5}, {3, 5}, {2}, {3}, {5}, with sup=50%, 75%, 50%, 75%, 75%, 75% respectively These generate the following association rules: 2,3 5, confidence=100%; (sup(i)=50%; sup{2,3}=50%; 50/50= 1) 2,5 3, confidence=67%; (50/75) 3,5 2, confidence=100%; ( ) 2 3,5, confidence=67% 3 2,5, confidence=67% 5 2,3, confidence=67% All rules have support of at least 50% TID T100 T200 T400 Items 1, 3, 4 2, 3, 5 2, 5 T300 1, 2, 3, 5 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 35 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 36 6
9.3 Apriori Algorithm: Step 2 9.3 Apriori Algorithm Rule generation, summary In order to obtain X Y, we need to know support(i) and support(x) All the required information for confidence computation has already been recorded in itemset generation No need to read the transactions data any more This step is not as time-consuming as frequent itemsets generation Apriori Algorithm, summary If k is the size of the largest itemset, then it makes at most k passes over data (in practice, k is bounded e.g., 10) The mining exploits sparseness of data, and high minsup and minconf thresholds High minsup threshold makes it impossible to find rules involving rare items in the data. The solution is a mining with multiple minimum supports approach DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 37 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 38 Mining with multiple minimum supports Single minimum support assumes that all items in the data are of the same nature and/or have similar frequencies, which is incorrect In practice, some items appear very frequently in the data, while others rarely appear E.g., in a supermarket, people buy cooking pans much less frequently than they buy bread and milk Rare item problem: if the frequencies of items vary significantly, we encounter two problems If minsup is set too high, those rules that involve rare items will not be found To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 39 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 40 Multiple Minimum Supports Each item can have a minimum item support Different support requirements for different rules To prevent very frequent items and very rare items from appearing in the same itemset S, we introduce a support difference constraint (φ) max i S {sup(i)} - min i S {sup(i)} φ, where 0 φ 1 is user specified Minsup of a rule Let MIS(i) be the minimum item support (MIS) value of item i. The minsup of a rule R is the lowest MIS value of the items in the rule: Rule R: i 1, i 2,, i k i k+1,, i r satisfies its minimum support if its actual support is min(mis(i 1 ), MIS(i 2 ),, MIS(i r )) E.g., the user-specified MIS values are as follows: MIS(bread) = 2%, MIS(shoes) = 0.1%, MIS(clothes) = 0.2% clothes bread [sup=0.15%,conf =70%] doesn t satisfy its minsup clothes shoes [sup=0.15%,conf =70%] satisfies its minsup DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 41 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 42 7
Downward closure property is not valid anymore E.g., consider four items 1, 2, 3 and 4 in a database Their minimum item supports are MIS(1) = 10%, MIS(2) = 20%, MIS(3) = 5%, MIS(4) = 6% {1, 2} has a support of 9% and it is infrequent since min(10%, 20%) > 9%, but {1, 2, 3} is with a support of 7% frequent because min(10%, 20%, 5%) < 7% If applied, downward closure, eliminates {1, 2} so that {1, 2, 3} is never evaluated How do we solve the downward closure property problem? Sort all items in I according to their MIS values (make it a total order) The order is used throughout the algorithm in each itemset Each itemset w is of the following form: {w[1], w[2],, w[k]}, consisting of items, w[1], w[2],, w[k], where MIS(w[1]) MIS(w[2]) MIS(w[k]) DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 43 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 44 Multiple minimum supports is an extension of the Apriori algorithm Step 1: frequent itemset generation Initial step Produce the seeds for generating candidate itemsets Candidate generation For k = 2 Generalization For k > 2, pruning step differs from the Apriori algorithm Step 2: rule generation : Step 1 Step 1: frequent itemset generation E.g., I={1, 2, 3, 4}, with given MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, and consider n=100 transactions: Initial step Sort I according to the MIS value of each item. Let M represent the sorted items Sort I, in M = {3, 4, 1, 2} Scan the data once to record the support count of each item E.g., {3}:6, {4}:3, {1}:9 and {2}:25 out of 100 transactions DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 45 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 46 : Step 1 MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, n=100 {3}:6, {4}:3, {1}:9 {2}:25 Go through the items in M to find the first item i, that meets MIS(i). Insert it into a list of seeds L For each subsequent item j in M (after i), if sup(j) MIS(i), then insert j in L MIS(3) = 5%; sup ({3}) = 6%; sup(3) > MIS(3), so L={3} Sup({4}) = 3% < MIS(3), so L remains {3} Sup({1}) = 9% > MIS(3), L = {3, 1} Sup({2}) = 25% > MIS(3), L = {3, 1, 2} Calculate F 1 from L based on MIS of each item in L F 1 = {{3}, {2}}, since sup({1}) = 9% < MIS(1) Why not eliminate {1} directly? Why calculate L and not directly F? Downward closure property is not valid from F anymore due to multiple minimum supports : Step 1 Items 1 2 3 4 Candidate generation, k = 2. Let φ = 10% (support difference) L {3, 1, 2} Take each item (seed) from L in order. Use L and not F 1 due to the downward closure property invalidity! Test the chosen item against its MIS: sup({3}) MIS(3) If true, then we can use this value to form a level 2 candidate If not, then go to the next element in L If true, e.g., sup({3}) = 6% MIS(3) = 5%, then try to form a 2 level candidate together with each of the next items in L, e.g., {3, 1}, then {3, 2} MIS 10 20 5 6 SUP 9 25 6 3 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 47 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 48 8
: Step 1 {3, 1} is a candidate : sup({1}) MIS(3) and sup({3}) sup({1}) φ sup({1}) = 9%; MIS(3) = 5%; sup({3}) = 6%; φ := 10% 9% > 5% and 6%-9% < 10%, thus C 2 = {3, 1} Now try {3, 2} sup({2}) = 25%; 25% > 5% but 6%-25% > 10% so this candidate will be rejected due to the support difference constraint Items 1 2 3 4 MIS 10 20 5 6 SUP 9 25 6 3 L {3, 1, 2} : Step 1 Items 1 2 3 4 Pick the next seed from L, i.e. 1 (needed to try {1,2}) MIS SUP 10 9 20 25 5 6 6 3 sup({1}) < MIS(1) so we can not use L {3, 1, 2} 1 as seed! Candidate generation for k=2 remains C 2 = {{3, 1}} Now read the transaction list and calculate the support of each item in C 2. Let s assume sup({3, 1})=6, which is larger than min(mis(3), MIS(1)) Thus F 2 = {{3, 1}} DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 49 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 50 : Step 1 : Step 1 Generalization, k > 2 uses L k-1 as input and returns a superset (candidates) of the set of all frequent k- itemsets. It has two steps: Join step: same as in the case of k=2 I k = join(a k-1, B k-1 ) A k-1 = {i 1, i 2,, i k-2, i k-1 } and B k-1 = {i 1, i 2,, i k-2, i k-1 } and i k-1 < i k-1 and sup(i k-1 ) sup(i k-1 ) φ. Then I k = {i 1, i 2,, i k-2, i k-1, i k-1 } Prune step: for each (k-1) subset s of I k, if s is not in F k-1, then I k can be removed from C k (it is not a good candidate). There is however one exception to this rule, when s does not include the first item from I k Generalization, k > 2 example: let s consider F3={{1, 2, 3}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5}, {1, 4, 6}, {2, 3, 5}} After join we obtain {1, 2, 3, 5}, {1, 3, 4, 5} and {1, 4, 5, 6} (we do not consider the support difference constraint) After pruning we get C 4 = {{1, 2, 3, 5}, {1, 3, 4, 5}} {1, 2, 3, 5} is ok {1, 3, 4, 5} is not deleted although {3, 4, 5} F 3, because MIS(3) > MIS(1). If MIS(3) = MIS(1), it could be deleted {1, 4, 5, 6} is deleted because {1, 5, 6} F 3 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 51 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 52 : Step 2 Step 2: rule generation Downward closure property is not valid anymore. As a consequence we have frequent k order items, which may contain (k-1) non-frequent sub-items For those non-frequent items we do not have the support value recorded This problem arises when we form rules of the form A,B C, where C is the rare item: MIS(C) = min(mis(a), MIS(B), MIS(C)) Conf(A,B C) = sup({a,b,c}) / sup({a,b}) We have the frequency of {A, B, C} because it is frequent, but we don t have the frequency to calculate support of {AB} if it is not frequent by itself This is called head-item problem : Step 2 Items {Clothes},{Bread} {Shoes, Clothes, Bread} SUP 0.15 0.12 Rule generation Items Bread Clothes Shoes example MIS 2 0.2 0.1 {Shoes, Clothes, Bread} is a frequent itemset since MIS({Shoes, Clothes, Bread}) = 0.1 < sup({shoes, Clothes, Bread}) = 0.12 However {Clothes, Bread} is not since neither Clothes nor Bread can seed frequent itemsets So we may not calculate the confidence of all rules depending on Shoes, i.e. rules: Clothes, Bread Shoes Clothes Shoes, Bread Bread Shoes, Clothes DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 53 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 54 9
: Step 2 Head-item problem e.g.: Clothes, Bread Shoes; Clothes Shoes, Bread; Bread Shoes, Clothes. If we have some item on the right side of a rule, which has the minimum MIS (e.g. Shoes), we may not be able to calculate the confidence without reading the data again Advantages It is a more realistic model for practical applications The model enables us to find rare item rules, but without producing a huge number of meaningless rules By setting MIS values of some items to 100% (or more), we can exclude certain items DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 55 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 56 Mining Class Association Rules (CAR) Normal association rule mining doesn t have a target It finds all possible rules that exist in data, i.e., any item can appear as a consequent or a condition of a rule However, in some applications, the user is interested in some targets E.g. the user has a set of text documents from some known topics. He wants to find out what words are associated or correlated with each topic DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 57 9.3 Class Association Rules CAR, example A text document data set doc 1: Student, Teach, School : Education doc 2: Student, School : Education doc 3: Teach, School, City, Game : Education doc 4: Baseball, Basketball : Sport doc 5: Basketball, Player, Spectator : Sport doc 6: Baseball, Coach, Game, Team : Sport doc 7: Basketball, Team, City, Game : Sport Let minsup = 20% and minconf = 60%. Examples of class association rules: Student, School Education [sup= 2/7, conf = 2/2] Game Sport [sup= 2/7, conf = 2/3] DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 58 9.3 Class Association Rules In DW, CAR can successfully be used to mine out rules for profiling the buyers of a certain product middle aged, has children, university degree : owns house CAR can also be extended with multiple minimum supports E.g. a data set with two classes, Yes and No. We may accept weaker positive (Yes) class rules with minimum support of 45% and only strong negative classes (No) with minimum support of 75% By setting minimum class supports to 100% we can skip generating rules of those classes Tools Open source projects Weka RapidMiner Commercial Intelligent Miner, replaced by DB2 Data Warehouse Editions PASW Modeler, developed by SPSS Oracle Data Mining (ODM) DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 59 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 60 10
Apriori algorithm, on CAR data-sets for profiling car buyers preferences Class values: unacceptable, acceptable, good, very good And 6 attributes: Buying cost: vhigh, high, med, low Maintenance costs: vhigh, high, med, low Apriori algorithm Largest frequent itemsets comprise 3 items Most powerful rules are simple rules Most of the people find 2 person cars unacceptable DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 61 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 62 Lower confidence rules (62%) If 4 seat car, is found unacceptable, the it is because it s unsafe (rule 30) Open source projects also have their limits Car accidents data set 350 000 rows 54 attributes DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 63 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 64 Summary Business Intelligence Overview Customer segmentation, propensity to buy, customer profitability, attrition, etc. Data Mining Overview Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Association Rule Mining Apriori algorithm, support, confidence, downward closure property Multiple minimum supports solve the rare-item problem Head-item problem Next lecture Data Mining Time Series data Trend and Similarity Search Analysis Sequence Patterns Data Warehousing & OLAP Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 65 DW & DM Wolf-Tilo Balke Institut für Informationssysteme TU Braunschweig 66 11