Interesting Patterns. Jilles Vreeken. 15 May 2015

Size: px

Start display at page:

Download "Interesting Patterns. Jilles Vreeken. 15 May 2015"

Raymond George
5 years ago
Views:

1 Interesting Patterns Jilles Vreeken 15 May 2015

2 Questions of the Day What is interestingness? what is a pattern? and how can we mine interesting patterns?

3 What is a pattern? Data Pattern y = x - 1

4 What is a pattern? Recurring structure Data Pattern

5 Pattern mining, formally For a database dd a pattern language P and a set of constraints C the goal is to find the set of patterns S P such that each p P satisfies each c C on dd, and S is maximal That is, find all patterns that satisfy the constraints

6 Frequent Pattern Mining Suppose a supermarket, which sells items, I, and logs every transaction t I in a database db an interesting question to ask is, What products are often sold together? pattern language: all possible sets of items P = (I) pattern: an itemset, X I, X P

7 Frequent Itemsets sssssss( ) = 3

Frequent Conjunctive Formulas 4.9, 3.1, 1.5, 0.1, Iris-setosa 5.0, 3.2, 1.2, 0.2, Iris-setosa 5.5, 3.5, 1.3, 0.2, Iris-setosa 4.9, 3.1, 1.5, 0.1, Iris-setosa 4.4, 3.0, 1.3, 0.2, Iris-setosa Petal length >= 2.

8 Frequent Conjunctive Formulas 4.9, 3.1, 1.5, 0.1, Iris-setosa 5.0, 3.2, 1.2, 0.2, Iris-setosa 5.5, 3.5, 1.3, 0.2, Iris-setosa 4.9, 3.1, 1.5, 0.1, Iris-setosa 4.4, 3.0, 1.3, 0.2, Iris-setosa Petal length >= , 3.4, 1.5, 0.2, Iris-setosa and Petal width <= , 3.5, 1.3, 0.3, Iris-setosa 4.5, 2.3, 1.3, 0.3, Iris-setosa 4.4, 3.2, 1.3, 0.2, Iris-setosa 5.0, 3.5, 1.6, 0.6, Iris-setosa 5.1, 3.8, 1.9, 0.4, Iris-setosa 4.8, 3.0, 1.4, 0.3, Iris-setosa 5.1, 3.8, 1.6, 0.2, Iris-setosa 4.6, 3.2, 1.4, 0.2, Iris-setosa 5.3, 3.7, 1.5, 0.2, Iris-setosa 5.0, 3.3, 1.4, 0.2, Iris-setosa 7.0, 3.2, 4.7, 1.4, Iris-versicolor 6.4, 3.2, 4.5, 1.5, Iris-versicolor 6.9, 3.1, 4.9, 1.5, Iris-versicolor 5.5, 2.3, 4.0, 1.3, Iris-versicolor 6.5, 2.8, 4.6, 1.5, Iris-versicolor 5.7, 2.8, 4.5, 1.3, Iris-versicolor 6.3, 3.3, 4.7, 1.6, Iris-versicolor 4.9, 2.4, 3.3, 1.0, Iris-versicolor 6.6, 2.9, 4.6, 1.3, Iris-versicolor

9 Frequent Subgraphs

10 The Frequent Pattern Problem The task is to find all frequent patterns how often is X sold sup X = t dd X t dd the number of transactions in dd that support the pattern often enough sup mmmmmm dd have a support higher than the minimal-support threshold So, the problem is to find all X with sup dd how can do we do this? X mmmmmm

11 Monotonicity The number of possible patterns is exponential, and hence exhaustive search is not a feasible option. However, in 1994 it was discovered that support exhibits monotonicity. That is, for two itemsets X and Y, we know X Y ssss X ssss Y This is known as the A Priori property. It allows efficient search for frequent itemsets over the lattice of all itemsets.

12 The Itemset Lattice abcd (1) a b c d abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (4) b (4) c (3) d (3) (6) data itemset lattice

13 The Itemset Lattice abcd (1) a b c d frequent abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) (6) data itemset lattice

14 Levelwise search 1. F 1 = {i I ssss i mmmmmm} 2. while F k not empty 3. C k+1 = X P X = k + 1, Y X, Y =k Y F k 4. F k+1 = X C k+1 ssss X mmmmmm 5. return F 1 F 2 The A Priori algorithm can be applied to mine patterns for any enumerable pattern language P for any monotonic constraint c. Many algorithms exist that are more efficient, but none so versatile.

15 Problems in pattern paradise The pattern explosion high thresholds few, but well-known patterns low thresholds a gazillion patterns Many patterns are redundant Unstable small data change, yet different results even when distribution did not really change

16 The Wine Explosion the Wine dataset has 178 rows, 14 columns

17 To the Max! Why not just report only patterns for which there is no extension that is frequent? These patterns are called maximally frequent. abcd (1) frequent abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) (6) (Bayardo, 1998)

18 Closure! Why throw away so much information? If we keep all X that cannot be extended without ssss X dropping, all frequent itemsets and their frequencies can be reconstructed without loss! These are called closed frequent itemsets. abcd (1) frequent abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) (6) (Pasquier, 1999)

19 Non-Derivable Patterns Through inclusion/exclusion, we can derive the support of aaa. As ssss bb = ssss c = 2, we know b and c always co-occur. Then, knowing that ssss aa = 2, we can derive ssss aaa = 2. abcd (1) frequent abc (2) abd (3) acd (1) bcd (1) non-derivable ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) (6) (Calders & Goethals, 2003)

20 Margin-Closed Who cares that we can reconstruct all frequencies exactly? Why not allow a little bit of slack and zap more patterns? That is the main of idea of margin-closed frequent itemsets. abcd (1) frequent abc (2) abd (3) acd (1) bcd (1) ab (4) ac (2) ad (3) bc (2) bd (3) cd (1) a (5) b (4) c (3) d (3) (6) (Moerchen et al, 2011)

21 Associations Why is a frequent pattern X interesting? Because it identifies associations between elements of X. Many people buy both and. What s going on? Many patients have active genes A, B and C. What s going on? Many molecules share this structure. What s going on? Okay but does higher frequency mean more interesting?

22 Expectation Frequency alone is deceiving, and leads to redundant results. Say that many many people buy. Then all `real patterns, such as can be extended with and we likely find that is also frequent. Do we want it to be reported? Not unless its support deviates strongly from our expectation.

23 What did you expect? What do we expect? How do we model this? How can we measure whether expectation and reality are different enough? Let s start simple. Let s assume all items are independent.

24 Independence! Under the assumption that all items i I are independent, the expected frequency of an itemset X is simply iii X = ff(x) x X where we write ff x = ssss x support of an item x X in our database. dd for the frequency the relative Item frequencies can easily be extracted from data, as well as reasonably expected to be known by your domain expert.

25 Bro, do you even lift? We want to identify patterns for which their frequency in the data deviates strongly from our expectation. One way to measure this deviation is lift. llll X = ff X iii X Patterns with a lift higher than 1 are more frequent than expected. Those with lift lower than 1 are less frequent. In our data/lattice example llll AA = 1.2 and llll AAA = 1.83, (IBM, 1996)

26 Example: Lift A B C D ff AA = 4 6 iii AA = = 0.55 llll AA = = ff AAA = 2 6 iii AAA = = llll AAA = = 1.83 That is, according to llll, AAA is more interesting than AA.

27 Lift Lift is ad hoc. Lift strongly over-estimates, or under-estimates how surprising the frequency of a pattern is. It is a bad interestingness measure. Somewhat more formally: Lift is ad hoc because it compares scores directly, it does not consider how likely scores are, and does not use a proper statistical test to determine how significant the deviation is.

28 Better Lift The probability of a random transaction to support X is iii(x). Assume our dataset contains N transactions, and let Z be a random variable to state how many transactions support X. Then, P(Z = M) is the probability that the support of X is M, and is given by the binomial distribution, with q = iii X p Z = M = N M qm 1 q N M We can now calculate how likely it is to observe a support of ssss(x) or higher, and decide whether the p-value p Z ssss X N is significant (eg. < 0.05)

29 Aside: using Surprisingness While we re in the business of unrealistic assumptions, say we have a good way to calculate a p-value for X, What can we do with it? There are two main approaches, 1) mine all patterns up to a certain threshold 2) mine the top-k most surprising patterns

30 Too much of a good thing Under the independence assumption, we compare P AAAA P A P B P C P(D) And hence, any deviation from total independence is gauged as interesting. Say that AAA is the true pattern, then it will have a high lift, but so will any extension of it! In other words, all supersets of a pattern are also scored highly. Why? How can we avoid this? Which ones should we report?

31 Partitions Webb proposed that we should report only those patterns for which the frequency is surprising with regard to all its partitions. P AAA P A P B P C P AB P C P AC P B P A P BB Sounds like a good idea! But, how many partitions are there? And, how do we test for surprisingness? When we a consider 2-partition we can use Fisher s exact test. Webb tests against the partition of X = Y Z s.t. Y Z = and ff Y ff(z) is closest to ff X (Webb, 2010 )

32 Applying Fisher s Exact Test Let s test ff AAA vs. ff AA ff(c) AA AA C C p 1 p 1 p 2 a b a+b p 2 c d c+d a+c b+d a+b+c+d p = a + b a n a + c c + d c = a + b! c + d! a + c! b + d! a! b! c! d! n! We get p AA, C = 0.6, meaning that AAA is not interesting, yay! (Fisher 1922; Webb 2010; Hamalainen 2012; Webb & Vreeken 2014)

33 More Elaborate Models Although the math and stats get more and more complicated, the models and tests we saw so far really are straightforward. How can we infuse more background knowledge? We can test against Bayesian Networks (Jaroszwicz et al. 2004), Maximum Entropy models (Wang et al 2006, Mampaey et al 2012, ) Goes (probably) too deep for today. Let s re-consider this in a few weeks.

34 Tiles Only considering how often something occurs biases us towards patterns with low cardinality. Instead, we can consider how much of the data a pattern covers. That is, a pattern X now consists of a row-set, and a column-set, and is regarded as more interesting the larger aaaa X = rrrr X cccc(x) (Geerts et al 2014)

35 Patterns: Large Tiles Genes Conditions

36 Example: Tiles

37 Mining Large Tiles Sadly, aaaa is not (anti-)monotonic. Extending the column set of a given tile X may result in both an increase or decrease in aaaa. How can we mine tiles efficiently? Through depth-first search, using branch-and-bound. If you keep the row-set maximal, you can keep track of the conditional support of all not-yet-included items. Assuming maximal correlation, you have an upper bound.

38 Mining Tilings A big pile of Tiles is as bad as a big pile of Frequent Itemsets. Way too many, way too redundant. Instead, we can also ask to find a set of tiles that together cover as many of the ones in the data as possible. This means we are doing set cover, which is well-known to be NP-hard, but for which the greedy algorithm is known to be the best possible polynomial time approximation algorithm.

39 Exact and Noisy Tiles Let ff(x) for a tile X be the relative number of 1s in the tile, oooo X = I(D i,j = 1) i cccc(x) j rrrr(x) ff X = oooo(x) cccc X rrrr X For ff X = 1 or 0 we say the tile is exact. Otherwise, it is noisy. (Tatti & Vreeken 2011)

40 Noisy Tiles Boolean Matrix Factorisation (BMF) aims to find a low-rank decomposition of a given binary matrix A into row and column factor matrices B and C, such that A B C where is the binary product, i.e = 0, = 1, = 1, and we want to minimize the error between A and B C When restricted to exact tiles (factors), BMF and Tiling are equivalent. BMF, however, is more general, as it allows for errors. (Miettinen et al 2006, )

41 Conclusions Patterns are a powerful concept that can show a lot of insight in how your data is locally distributed. Monotonic constraints allow for efficient mining levelwise search always works more elegant algorithms exist works for other pattern types equally well: itemsets, sequences, trees, streams, low-entropy sets Measuring interestingness is inherently difficult frequency alone is a bad measure independence models are too weak stronger models are computionally expensive

42 Thank you! Patterns are a powerful concept that can show a lot of insight in how your data is locally distributed. Monotonic constraints allow for efficient mining levelwise search always works more elegant algorithms exist works for other pattern types equally well: itemsets, sequences, trees, streams, low-entropy sets Measuring interestingness is inherently difficult frequency alone is a bad measure independence models are too weak stronger models are computionally expensive

Chapter 4: Frequent Itemsets and Association Rules

Chapter 4: Frequent Itemsets and Association Rules Jilles Vreeken Revision 1, November 9 th Notation clarified, Chi-square: clarified Revision 2, November 10 th details added of derivability example Revision