CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

Size: px

Start display at page:

Download "CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets"

Claribel Campbell
5 years ago
Views:

1 CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets Jaakko Hollmén, Department of Computer Science

2 Rest of the course In the first part of the course most of the variables had real values. In the two last chapters we look at discrete patterns. Chapter 8: Discrete methods for analyzing large binary datasets Frequent itemsets and breadth-first search Chapter 9: Searching the Web for relevant pages: random walks on the Web Hubs-and-authorities algorithm We are only able to take a cursory look at these important areas of research 2

3 Contents of this chapter 3

4 Large binary datasets Many rows (variables), many columns (observations) 0/1 data: does that thing occur in this observation? The cell D(m, h) of matrix (table) D tells us whether the phenomenon described by variable m is present in observation h: D(m, h) = 1 or D(m, h) = D =

5 Binary data example: Words appearing in documents Observation: document Variable: word (in its basic form) Data D(s, d) = 1: did the word s appear at least once in document d? Figure: Word-document matrix from the computer exercises (Round 5). Black stripe = word s appears in document d; D(s, d) = 1. A corpus of 56 documents with 5098 different words, including two intentionally generated "copies". 5

6 Binary data example: Web pages Observation: Web page Variable: Web page D(s, p) = 1 if and only if page s links to page p 6

7 Binary data example: Mammal populations in nature Observation: 50km 50km square r Variable: Mammal species l D(l, r) = 1 if and only if species l was observed in square r 7

8 Binary data example: Molecules within larger molecules Observation: molecule m (typically large) Observation: molecule p (smaller) D(p, m) = 1 if and only if molecule p exists as part of molecule m 8

9 NSF 0-1 document example: Source data Abstract data base: abstracts describing scientific ventures funded by NSF nsfawards.html 9

10 NSF 0-1 data representation Text documents (abstracts) transformed into a 0/1 tabular format: 1 = appears at least once Columns represent documents, rows represent words Example words with indices:..., 73:abscissa, 74:absence, 75:absent, 76:absolute, 77:absolutely, 78:absorb, 79:absorbed,... #1 #2 # :additional :genetic :studies

11 NSF 0-1 document example: Why this representation? From the point of view of information search, a list of words contained within a document is often enough Fixed-format data is easier to process than variable-length strings 11

12 NSF 0-1 document example: Basic statistics documents, words not a very large set of documents The table has = cells There are (0.265%) ones in the dataset; no point in representing zeros There are approximately 81 different words in each document (1 = the word appears at least once) Each word appears in about 340 documents The distribution is skewed: some words appear often, others very rarely Some very common words have been omitted ( blacklist ) 12

13 NSF 0-1 document example: Number of different words by document 1/3 Figure: X axis: Document index number, documents in total. Y axis: Number of different words in document. 13

14 Number of different words by document 2/3 Figure: X axis: Document index number, documents arranged by number of different words in document. Y axis: Number of different words in document. 14

15 Number of different words by document 3/3 Figure: Number of different words in document as histogram. Document length varies mainly between (different) words. 15

16 NSF 0-1 document example: How many documents contain a certain word? 1/4 Figure: X axis: Word index number (1,..., 30800). Y axis: Occurrences 10 4 (1,..., 80000). 16

17 How many documents contain a certain word? 2/4 Figure: X axis: Word index number (1,..., 30800), arranged by frequency of occurrence. Y axis: Occurrences 10 4 (1,..., 80000). 17

18 How many documents contain a certain word? 3/4 Figure: X axis: Word index number (1,..., 30800), arranged by frequency of occurrence. Y axis: (Base 2) logarithm of occurrences 10 4 (1,..., 80000). Why is the graph between ( 4000,..., 16000) almost a straight line? 18

19 How many documents contain a certain word? 4/4 Figure: Histogram of word occurrence. 19

20 Binary dataset as a network Network (or graph): a set of nodes (partially) connected with edges A binary dataset is easy to turn into a bipartite network: variables and observations are nodes, with an edge connecting them if and only if the corresponding data value is 1 20

21 What could we look for in data like this? What does the data contain? What kind of groups do the variables and observations form? How do the variables depend on each other? Etc. Analysis of variable and observation degree (how many outbound edges does this variable have = how many observations involve this variable): power laws - a very active area of research How do we look for small interesting sets of variables? 21

22 Searching for commonly appearing combinations of variables: Notation x 11 x 12 x 1 3 x 1n x 21 x 22 x = x 31 x d1 d-dimensional observation vector x; its cells are x 1, x 2,..., x d Many observation vectors x(1),..., x(n) Value of variable i in observation vector x(j) is denoted with x(i, j) or x ij x dn 22

23 Frequent itemsets: Definition If there are eg variables, it s not possible to search for all pairwise correlations How about variables commonly appearing together? Search for all variable pairs (a, b) so that there are at least N observations x(i) where x(a, i) = 1 and x(b, i) = 1 Generally: Search for all variable sets {a 1, a 2,..., a k } so that there are at least N observations x(i) where x(a 1, i) = 1 and x(a 1, i) = 1 and... and x(a k, i) = 1 Let s call this variable set {a 1, a 2,..., a k } a frequent itemset 23

24 Frequent itemsets: Example We have 4 shopping bags, each of which either contain or don t contain oranges (a), bananas (b), and apples (c). Bag #1: 5 oranges, 2 bananas. Bag #2: 4 bananas. Bag #3: 1 orange, 2 bananas, 1 apple. Bag #4: 8 oranges, 2 apples. 1 = bag contains item, 0 = bag does not contain item: X = With threshold value N = 2, frequent itemsets are {a}, {b}, {c}, {a, b} and {a, c}. For example, oranges (a) and apples (c) appear together ({a, c}) in at least two bags. 24

25 How many frequent itemsets are there? If the data only contains ones, all pairs and subsets of variables fulfill the criterion In order for the problem to be sensible, the data has to be sparse. This is typically the case. In the previous example both shopping bags and products number in the thousands, but a customer only buys a small number of products at one time. The sparse matrix has way more zeros than ones. Using machine learning terminology: we look for commonly occurring positive conjunctions (A B) 25

26 How do we search for frequent itemsets? Trivial approach: brute force search through all subsets of variables d variables 2 d subsets of variables If there are e.g. 100 variables, there are way too many subsets We have to be more clever Basic observation: if a set of variables {a 1, a 2,..., a k } is frequent, all its subsets are frequent Conversely: A set of variables {a 1, a 2,..., a k } can be frequent only if all of its subsets are frequent. The frequentness of subsets is a necessary but not sufficient condition. 26

27 Example of frequent itemset search 1/3 Let s assume the variable sets {a}, {b}, {c}, {e}, {f }, {g} (set size 1) occur often enough (we have at least N observations of each) Then the pair (i, j) is a candidate frequent itemset, when ) = 6! 4! 2! = 15 i, j {a, b, c, e, f, g}. There are ( 6 2 candidate pairs. Let s assume that only the pairs {a, b}, {a, c}, {a, e}, {a, f }, {b, c}, {b, e}, {c, g} are frequent (size 2) i.e. they occur together at least N times Let s combine these pairs into sets of size 3. The next step is to remove those sets which include non-frequent sets of size 2. Now we have candidate frequent itemsets of size 3. For instance frequent pairs {a, b} and {a, f } can be combined as {a, b, f }, but then the subset {b, f } is not frequent, so {a, b, f } is not accepted as a candidate 27

28 Example of frequent itemset search 2/3 Then {a, b, c} and {a, b, e} are the only possible frequent itemsets of size 3, because all their subsets are frequent Let s assume that {a, b, c} is a frequent itemset (of size 3) We also already know that there can t be a frequent itemset of size 4 (or larger), because the subsets of {a, b, c, e} include e.g. {a, c, e} which is not frequent. In other words: the non-frequent set {a, c, e} can t be turned into a frequent itemset by adding variables. As we now know that candidates of size 4 can t exist, the search ends. 28

29 Example of frequent itemset search 3/3 The final step is to list all frequent itemsets {C 0, C 1, C 2, C 3 }: {{}, {a}, {b}, {c}, {e}, {f }, {g}, {a, b}, {a, c}, {a, e}, {a, f }, {b, c}, {b, e}, {c, g}, {a, b, c}} Here assume means has been computed from data. In this example we did not have the actual data (matrix X ). 29

30 Breadth-first search: Principle Breadth-first search (aka levelwise algorithm, a priori algorithm) is a simple but effective algorithm for frequent itemset search First, we look for frequent itemsets of size 1, then size 2, etc. The basic observation we made earlier becomes very useful: a set can be frequent only if all its subsets are frequent 30

31 Breadth-first search: Frequent itemset of size 1 Frequent itemsets of size 1: variable a that has at least N ones in the dataset i.e. there are at least N observations (columns) i for which x(a, i) = 1. These are easy to find by going through the data once and summing the occurrences of each variable. 31

32 Breadth-first search: Frequent itemset of size 2 When we know the frequent itemsets of size 1, we form candidate sets of size 2 The set {a, b} is a candidate set when both {a} and {b} are frequent Let s go through the data again, noting for each candidate set how many observations have a 1 for both variables a and b. Now we have found the frequent itemsets of size 2. 32

33 Breadth-first search: Frequent itemset of size k 1/2 Let s assume that we know all the frequent itemsets of size k; let s call this collection C k Let s form candidate sets of size k + 1: sets that could be frequent A candidate set is a set of k + 1 cells where all its subsets are frequent. It is enough to check that all subsets of size k are frequent. Let s take all pairs of sets A = {a 1,..., a k } and B = {b 1,..., b k } from collection C k, compute their union A B, check that the union has k + 1 cells and that all subsets of size k are frequent (i.e. are in the collection C k ). 33

34 Breadth-first search: Frequent itemset of size k 2/2 When we have found the candidate sets of size k + 1, we go through the data and note for each candidate set the number of columns where all the candidate set variables have the value 1 Now we have found the frequent itemsets of size k + 1 The algorithm is repeated until no new candidate sets can be found 34

35 Breadth-first search: Available implementations Let s keep on working with NSF data Many implementations of frequent itemset search are available on the Web, e.g. and Computer exercise TODO uses the apriori program by Bart Goethals to look for common products in cash register receipts and common words in coursework documents Exercise TODO explores the complexity of computing frequent itemsets 35

36 Frequent itemset search with threshold N = Using threshold value N = we can find a total of 250 frequent itemsets in the variables (words) of the corpus, originating from documents At least documents have 9 different three-word sets size candidates frequent itemsets

37 NSF example: Frequent itemset search with threshold N = 2000 Using a smaller threshold value of N = 2000 we can find frequent itemsets and a much greater number of candidate sets There is one six-word set appearing in at least 2000 documents size candidates frequent itemsets

38 NSF example: Number of frequent itemsets as a function of threshold N Threshold N: there have to be at least N columns with the value 1 threshold N frequent itemsets

39 NSF example: Frequent itemset of size 6 with threshold N = program project research science students university 39

40 What does the frequentness of a set of variables mean? If a set of variables occurs often, can this be due to chance? Let a and b be variables. In the data there are n a observations where a occurs and n b observations where b occurs n observations in total p(a = 1): probability that a random observation has a = 1 p(a = 1) = n a /n and p(b = 1) = n b /n If we assume a and b to be independent, in how many observations do they occur together? p(a = 1 b = 1) = p(a = 1)p(b = 1) n ab /n = (n a /n)(n b /n) = n a n b /n 2 40

41 NSF example of independence a = computational : n a = 7803 b = algorithms : n b = 6812 Independence: computational and algorithms occur in a random column with the probability n a n n b n = = The expected value for columns like this is n n a n n b n = n an b n = 415 There are n ab = 1974 columns like this in the data Is this significant? (We ll get back to this shortly.) 41

42 Looking for pairs occurring more often than random with awk Input: word and word pair frequencies in NSF data Usage: cat bag-of-words.txt./this.awk #!/usr/bin/awk -f BEGIN { N= } { if (NF==2) { f[$1]=$2; } if (NF==3) { p[$1, $2] = $3; } } END { for (u in f) { for (v in f) { if (p[u, v]>0) { print u,v,f[u],f[v],p[u,v],p[u,v]/(f[u]*f[v]/n) }}}} 42

43 Rough analysis of pairwise uncommon occurrence Ratio between observed and random occurrence of word pairs: p(a = 1 b = 1) p(a = 1)p(b = 1) = n ab /n (n a /n)(n b /n) = n n ab n a n b Columns: (1) word pair (a, b), (2) n a, (3) n b, (4) n ab, (5) p(a=1 b=1) p(a=1)p(b=1) Examples with a large ratio p(a=1 b=1) : p(a=1)p(b=1) fellowship postdoctoral film thin dissertation doctoral business innovation polymer polymers microscopy scanning

44 NSF example of independence: Lots of variable pairs Some examples of variable pairs Pair #100: held workshop Pair #500: gene identify Pair #1000: engineering manufacturing

45 Significance of frequent itemsets Let s think of data generation as a repeated experiment (binomial distribution) Let s make n observations (NSF data: n = ) We succeed with every observation when we produce an observation where both a = 1 and b = 1: this happens with the probability p = n a n b /n 2 The number of successes is binomially distributed Bin(n, p) with parameters n and n a n b /n 2 What is the probability that we will succeed at least n ab times? How small is the probability that we will get a result so far from the expected value? If this probability is small, then there is something interesting within the found pattern 45

46 Significance of frequent itemsets Estimating the probability of a binomial distribution tail: n B(n, i, p) i=n ab +1 where B(n, i, p) is the probability that in an experiment with n repetitions we will get exactly i successes, when the probability of a success is p: B(n, i, p) = ( n i ) p i (1 p) n i Traditional approximation methods (normal approximation, Poisson approximation) do not necessarily work here (probability p is small, but the number of successes is large) 46

47 Significance of frequent itemsets Chernoff bound Let X denote the number of successes in a repeated experiment of n repeats and p probability of success. Then the expected value of the number of successes is np. The probability Pr(X > (1 + δ)np) is bounded (Chernoff bound): ( ) e δ np Pr(X > (1 + δ)np) < < 2 δnp (1 + δ) 1+δ 47

48 Frequent itemset significance: Testing by simulation NSF data: a = computational ; n a = 7803 b = algorithms ; n b = 6812 common observations n ab = 1974 probability with assumed independence p(a = 1)p(b = 1) = n a n b /n 2 = Randomization: let s see what happens when we toss a coin n = times Probability of success p = Let s count the number of successes 48

49 Frequent itemset significance: Testing by simulation with Matlab p = ; count=zeros(10,1); % initialization for a=1:10 for i=1: % slow method if rand(1,1) < p count(a) = count(a) + 1; end end end count=zeros(1000,1); % initialization for a=1:1000 % faster method in Matlab count(a) = sum(rand(128000,1)<p); end hist(count,40); 49

50 NSF example of frequent itemset significance: Simulation result Figure: Histogram of number of successes in 1000 repeats. The probability of 1974 successes appears to be small but how small? 50

51 Frequent itemset significance Comparison with Chernoff bound Repeated experiment X Bin(n, p) with expected value of number of successes np ( ) e δ np Pr(X > (1 + δ)np) < < 2 δnp (1 + δ) 1+δ Now p = , expected value np = 415, observed value X = X > (1 + δ)np i.e. δ = (X np)/np = The probability is very small: Pr(X > 1974) = Pr(X > ( )415) < ( ) < δnp

52 Problem of multiple testing 1/2 Let s assume we find 100 interesting pairs of variables (a, b) and we wish to find out (for all pairs) whether n ab is significantly larger (or smaller) than the estimate given by the independence assumption n a n b /n. Let s further assume that for the pair (a, b) this difference occurs by chance with the probability 0.01 The probability that we get this result for at least one of the 100 pairs is (why?) 1 (1 0.01) When we examine the randomness of the frequency of occurrence of several patterns, we have take into account that we are observing several patterns 52

53 Problem of multiple testing 2/2 Bonferroni correction: if we are examining the probabilities of occurrence of M patterns, we have to multiply the random occurrence probability by M In the NSF example the probability of a pattern was so small that the correction does not change things Most of the variable pairs appear together so often that the probability of this happening by chance is extremely unlikely 53

54 What about larger frequent itemsets? Let s assume the variable set a, b, c occurs n abc times in the data (i.e. there are n abc columns in the data where the variables a, b, c have the value 1) Is this more than one should assume for the random case? What is the random case in this situation? E.g. n = 1000 and n a = n b = n c = 500. If the variables are independent, n abc should be approximately 125. If n abc = 250, then the set a, b, c is somehow unusual This could be e.g. because a and b are strongly correlated. If n ab = 500, then n abc = 250 is to be expected. What is a suitable zero hypothesis? E.g. the so-called maximum entropy hypothesis 54

55 How hard is it to look for frequent itemsets? We have 0/1 data and a threshold N, the task is to look for all frequent itemsets The answer can be very large In practice the a priori algorithm is efficient enough 55

56 Searching for commonly occurring patterns of different kind The same principles can be applied to other tasks Search for commonly occurring episodes Search for commonly occurring subnets Search for commonly occurring subtrees The same basic idea of the a priori algorithm works 56

57 Summary In this chapter we studied how to transform discrete datasets of various forms into binary (0/1) data. Binary data can be processed and analyzed efficiently with e.g. the a priori algorithm. 57

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association