1. Data summary and visualization

Size: px

Start display at page:

Download "1. Data summary and visualization"

Kellie West
6 years ago
Views:

1 1. Data summary and visualization 1

2 Summary statistics 1 # The UScereal data frame has 65 rows and 11 columns. 2 # The data come from the 1993 ASA Statistical Graphics Exposition, 3 # and are taken from the mandatory F&DA food label. 4 # The data have been normalized here to a portion of one American cup. 5 >library(mass) 6 >data(uscereal) 7 >summary(uscereal) 1 mfr calories protein fat sodium 2 G:22 Min. : 50.0 Min. : Min. :0.000 Min. : K:21 1st Qu.: st Qu.: st Qu.: st Qu.: N: 3 Median :134.3 Median : Median :1.000 Median : P: 9 Mean :149.4 Mean : Mean :1.423 Mean : Q: 5 3rd Qu.: rd Qu.: rd Qu.: rd Qu.: R: 5 Max. :440.0 Max. : Max. :9.091 Max. : fibre carbo sugars shelf 9 Min. : Min. :10.53 Min. : 0.00 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :18.67 Median :12.00 Median : Mean : Mean :19.97 Mean :10.05 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :68.00 Max. :20.90 Max. : potassium vitamins 17 Min. : % : 5 2

3 18 1st Qu.: 45.0 enriched:57 19 Median : 96.6 none : 3 20 Mean : rd Qu.: Max. : ># correlation matrix between some variables 2 >cor(uscereal[c("calories","protein","fat","fibre","sugars")]) 1 calories protein fat fibre sugars 2 calories protein fat fibre sugars >library(mass) >data(uscereal) >summary(uscereal) # summary statiscs for each variable mfr calories protein fat sodium G:22 Min. : 50.0 Min. : Min. :0.000 Min. : 50.0 K:21 1st Qu.: st Qu.: st Qu.: st Qu.:180.0 N: 3 Median :134.3 Median : Median :1.000 Median :232.0 P: 9 Mean :149.4 Mean : Mean :1.423 Mean :237.8 Q: 5 3rd Qu.: rd Qu.: rd Qu.: rd Qu.:290.0 R: 5 Max. :440.0 Max. : Max. :9.091 Max. :

4 fibre carbo sugars shelf Min. : Min. :10.53 Min. : 0.00 Min. : st Qu.: st Qu.: st Qu.: st Qu.:1.000 Median : Median :18.67 Median :12.00 Median :2.000 Mean : Mean :19.97 Mean :10.05 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:3.000 Max. : Max. :68.00 Max. :20.90 Max. :3.000 potassium vitamins Min. : % : 5 1st Qu.: 45.0 enriched:57 Median : 96.6 none : 3 Mean : rd Qu.:220.0 Max. :969.7 ># correlation matrix between some variables >cor(uscereal[c("calories","protein","fat","fibre","sugars")]) calories protein fat fibre sugars calories protein fat fibre sugars

5 1. Density visualization Histogram >hist(uscereal[,"protein"], main="uscereal data", xlab="protein") UScereal data Frequency protein 5

6 2. Density visualization Kernel smoothing >plot(density(uscereal[,"protein"],kernel="gaussian"), main="uscereal data", + xlab="protein") UScereal data Density protein 6

7 Boxplot >mfr=uscereal["mfr"] >boxplot(uscereal[mfr=="k","protein"], UScereal[mfr=="G", "protein"], + names=c("kellogs", "General Mills"), xlab="manufacturer", ylab="protein")) protein Kellogs General Mills Manufacturer 7

8 Quantile plot QQ plot displays (z k/(n+1),x (k) ), z q is qth quantile of N(0,1) Φ(z q ) = q, 0 < q < 1. >qqnorm(uscereal$calories) Normal Q Q Plot Sample Quantiles Theoretical Quantiles 8

9 Relations between two variables Scatterplot >plot(uscereal$fat, UScereal$calories, xlab="fat", ylab="calories") Calories Fat 9

10 Relations between more than two variables Scatterplot matrix >plot(uscereal[c("calories", "fat", "protein", "sugars","fibre", "sodium")]) calories fat protein sugars fibre sodium

11 Parallel plot >parallel( UScereal[, c("calories","protein", "fat", "fibre")]) fibre fat protein calories Min Max 11

12 2. Association rules (Market basket analysis) 12

13 Market basket analysis Association rules show the relationships between data items. Typical example A grocery store keeps a record of weekly transactions. Each represents the items bought during one cash register transaction. The objective of the market basket analysis is to determine the items likely to be purchased together by a customer. 13

14 Example Items: {Beer, Bread, Jelly, Milk, PeanutButter} Transaction t 1 t 2 t 3 t 4 t 5 Items Bread, Jelly, PeanutButter Bread, PeanutButter Bread, Milk, PeanutButter Beer, Bread Beer, Milk 100% of the time that PeanutButter is purchased, so is Bread. 33.3% of the time PeanutButter is purchased, Jelly is also purchased. PeanutButter exists in 60% of the overall transactions. 14

15 Definitions Given: a set of items I = {I 1,...,I m } a database of transactions D = {t 1,...,t n } where t i = {I i1,...,i ik } and I ij I Association rule Let X and Y be two disjoint subsets (itemsets) of I. We say that Y is associated with X (and write X Y) if the appearance of X in an transaction usually implies that Y occur in that transaction too. We identify X {X is purchased} 15

16 Support and confidence Support s of an association rule X Y is the percentage of transactions in the database that contain X Y s(x Y) = P(X Y) = 1 n n { } 1 t i (X Y). i=1 Confidence or strength α of an association rule X Y is the ratio of the number of transactions that contain X Y to the number of transactions that contain X α(x Y) = P(Y X) = P(X Y) P(X) = n i=1 1{ t i (X Y) } n i=1 1{ t i X } Problem: identify all rules with support and confidence s 0 and α 0. 16

17 Support and confidence of some rules X Y s α Bread PeanutButter 60% 75% PeanutButter Bread 60% 100% Beer Bread 20% 50% PeanutButter Jelly 20% 33.3% Jelly PeanutButter 20% 100% Jelly Milk 0% 0% 17

18 Other measures of rules quality Rules with high support and confidence may be obvious (not interesting). Lift (interest) lift(x Y) = 1 n P(X Y) P(X)P(Y) = n i=1 1(t i X Y) 1 n n i=1 1(t i X) 1 n n i=1 1(t i Y) Rules with lift 1 are interesting. Conviction conviction(x Y) = P(X)P(Y c ) P(X Y c ) = 1 n n i=1 1{t i X} 1 n n i=1 1{t i Y c } 1 n n i=1 1{t i X Y c } conviction = 1 if X and Y are not related. Rules that always hold have conviction =. 18

19 Lift and conviction of some rules X Y Bread PeanutButter 5 4 Lift Conviction 5 PeanutButter Bread Beer Bread 8 5 PeanutButter Jelly 5 3 Jelly PeanutButter 5 3 Jelly Milk

20 Mining rules from frequent itemsets 1. Find frequent itemsets (itemset whose number of occurrences is above a threshold s). 2. Generate rules from frequent itemsets. Input: D - database, I - collection of all items, L-collection of all frequent itemsets, s 0, α 0. Output: R - association rules satisfying s 0 and α 0. R = ; for each l L do for each x l such that x do if support(l) support(x) α then R = R {x (l x)}; 20

21 Example Assume s 0 = 30% and α 0 = 50%. Frequent itemset L {{Beer},{Bread},{Milk},{PeanutButter},{Bread,PeanutButter}} For l = {Bread, PeanutButter} we have two subsets: support({bread, PeanutButter}) support({bread}) support({bread, PeanutButter}) support({peanutbutter}) = = 0.75 > 0.5 = = 1 > 0.5 Conclusion: PeanutButter Bread and Bread PeanutButter are valid association rules. 21

22 Finding frequent itemsets: apriori algorithm Frequent itemset property Any subset of frequent itemset must be frequent Basic idea: Look at candidate sets of size i Choose frequent itemsets of the size i Generate frequent itemsets of size i + 1 by joining (taking unions of) frequent itemsets found till pass i+1. 22

23 Example: apriori algorithm s 0 = 30%, α 0 = 50% Pass Candidates Frequent itemsets 1 {Beer},{Bread},{Jelly} {Beer},{Bread}, {PeanutButter},{Milk} {Milk},{PeanutButter} 2 {Beer,Bread},{Beer,Milk}, {Bread,PeanutButter} {Bear,PeanutButter},{Bread,Milk}, {Bread,PeanutButter}, {Milk,PeanutButter} 23

24 Summary Efficient finding frequent itemsets Finding frequent itemsets is costly. If there are m items, potentially there may be 2 m 1 frequent itemsets. When all frequent itemsets are found, generating the association rules is easy and straightforward. 24

25 Example: DVD movies purchases Data: 1 > data<-read.table("dvddata.txt",header=t) 2 > data 3 Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1 LOTR Patriot Sixth.Sense

26 > Preparations 1 > nobs<-dim(data)[1] 2 > n<-dim(data)[2] 3 > namesvec<-colnames(data) 4 > namesvec 5 [1] "Braveheart" "Gladiator" "Green.Mile" "Harry.Potter1" 6 [5] "Harry.Potter2" "LOTR1" "LOTR2" "Patriot" 7 [9] "Sixth.Sense" 8 > 9 > # thresholds for rules 10 > supthresh< > conftresh< > lifttresh<-2 13 > 14 > sup1<-array(0,n) 15 > sup2<-matrix(0,ncol=n,nrow=n,dimnames=list(namesvec,namesvec)) Calculating the chance of appearance P(X) for each movie 1 > for (i in 1:n){ 2 + sup1[i]<-sum(data[,i])/nobs} 26

27 3 > sup1 4 [1] Calculating the chance of appearance P(X,Y) for each pair of movies 1 > for (j in 1:n){ 2 + if(sup1[j]>=supthresh){ 3 + for (k in j:n){ 4 + if (sup1[k]>=supthresh){ 5 + sup2[j,k]<-data[,j]%*%data[,k] 6 + sup2[k,j]<-sup2[j,k] } } } } 7 > sup2<-sup2/nobs 8 > sup2 1 Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1 2 Braveheart Gladiator Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense LOTR2 Patriot Sixth.Sense 12 Braveheart Gladiator Green.Mile Harry.Potter Harry.Potter

28 17 LOTR LOTR Patriot Sixth.Sense Calculating the confidence matrix P(column row) 1 > conf2<-sup2/c(sup1) 2 > conf2 3 Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 4 Braveheart Gladiator Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense LOTR1 LOTR2 Patriot Sixth.Sense 14 Braveheart Gladiator Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense

29 Calculating the lift matrix 1 > tmp<-matrix(c(sup1),nrow=n,ncol=n,byrow=true) 2 > tmp 3 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] 4 [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] > 14 > lift2<-conf2/tmp 15 > lift2 16 Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 17 Braveheart Gladiator Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense LOTR1 LOTR2 Patriot Sixth.Sense 27 Braveheart

30 28 Gladiator Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense Extracting and printing rules 1 > rulesmat<-(sup2>=supthresh)*(conf2>=conftresh)*(lift2>=lifttresh) > rulesmat 7 Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1 8 Braveheart Gladiator Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense LOTR2 Patriot Sixth.Sense 18 Braveheart Gladiator

31 20 Green.Mile Harry.Potter Harry.Potter LOTR LOTR Patriot Sixth.Sense > diag(rulesmat)<-0 30 > rules<-null 31 > for (j in 1:n){ 32 + if (sum(rulesmat[j,])>0){ 33 + rules<-c(rules,paste(namesvec[j],"->",namesvec[rulesmat[j,]==1],sep="")) 34 + } 35 + } 36 > rules 37 [1] "Green.Mile->LOTR1" "LOTR1->Green.Mile" "LOTR1->LOTR2" 38 [4] "LOTR2->LOTR1" If we set supthresh<-0.1 then we find 12 rules 1 > rules 2 [1] "Green.Mile->Harry.Potter1" "Green.Mile->LOTR1" 3 [3] "Green.Mile->LOTR2" "Harry.Potter1->Green.Mile" 4 [5] "Harry.Potter1->Harry.Potter2" "Harry.Potter1->LOTR2" 5 [7] "Harry.Potter2->Harry.Potter1" "LOTR1->Green.Mile" 6 [9] "LOTR1->LOTR2" "LOTR2->Green.Mile" 7 [11] "LOTR2->Harry.Potter1" "LOTR2->LOTR1" 31

732A61/TDDD41 Data Mining - Clustering and Association Analysis

732A61/TDDD41 Data Mining - Clustering and Association Analysis Lecture 6: Association Analysis I Jose M. Peña IDA, Linköping University, Sweden 1/14 Outline Content Association Rules Frequent Itemsets