MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang

2 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

3 Why Mining Consecutive Patterns? Many data are interesting on the linear structure level DNA, RNA, and Protein sequences People are interested in consecutive substrings

Why Approximate & Edit Distance? 4 Suppose we have the following three DNA sequences in database, and the minimum support threshold (σ) is set to 3 None of them will be treated as a frequent pattern...accgtgtaggtcg......accgtttaggtcg......acggtgtaggtcg... However, comparing to the total length of these three DNA sequence, the only different position is quite small and tolerant They are insertions, deletions and mutations Insertions and deletions are very common in DNA Hamming Distance cannot take care of them Edit Distance is the best fit

5 Why Maximal? The total number of possible patterns is O(n 2 ), where n is the length of the string Tooooo expensive when n grows to a million or a billion The total number of maximal patterns is O(n), which is acceptable

6 Related Work Related work Exact Match: Suffix Tree/Array Long Pattern: Pattern Fusion Hamming Distance: REPuter

7 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

8 Definitions: Basic S: a string of length S = n Σ: the alphabets set, for DNA, Σ = 4 S i : the i-th character of S S i,j : the substring starting from i and ending at j d s, t the edit distance between strings s and t

9 Definitions: Equivalent Neighbors Two substring S i,j and S x,y d S i,j, S x,y k k is the edit distance threshold~o log n Examples k = 2 ACGACA and ACGTACG are neighbors AACCGA and ACCAAG are not

10 Definitions: Approximate Support All neighbors redundant Disjoint neighbors Our choice

Our Goal: All Frequent & Maximal Long enough At least L Approximately Frequent approximate support σ Maximal Goal: Find ALL these maximal approximate frequent patterns 11

12 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

13 MACFP: Support Checking Framework Suppose there is an oracle, which can tell us the approximate support of the substring S i,j We need only O(n) times of queries

14 MACFP: Fast Chunk Indexing Edit Distance k Segment S i,j into k + 1 chunks At least one of these chunks should be exactly matched

15 MACFP: Efficient Expanding Dynamic Programming Algorithm Edit Distance between S and T If S 1 = T 1, d S 1,i, T 1,j = d S 2,i, T 2,j We can adopt this idea to greedily match two strings Exponential to k Fortunately, k is usually small!

16 MACFP: Lower Bound Pruning

17 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

18 Experiments: Compared Methods TDP dynamic programming-based method TDP+ applies Fast Chunk Indexing technique to accelerate TDP MACFP turns off Lower Bound Pruning technique in MACFP MACFP our proposed algorithm

Experiments: Edit Distance Exponential to k The growth of running time is slower than that of total number of patterns! 19

Experiments: Length Threshold Faster for larger L Because we have Fast Chunk Indexing 20

21 Experiments: Length of DNA Seq Scalable!

22 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

Application: Generation length-n normal DNA sequence S length-m fatal subsequence s s is duplicated for T times We allow at most 1 edit distance (10% probability per edit type) for potential variation in each copy The new (patient) DNA sequence is denoted by P. Random access gene subsequences from patient Hot region After MACFP, Only using maximal frequent patterns Hot region Normal Gene Fatal Gene Subsequence Fatal Gene Subsequence 23

24 Application: Real World Scenarios RMC read mapping and counting short tandem repeat n = 10,000 m = 50 T = 100 copy number variation n = 10,000 m = 1,000 T = 20

Conclusion & Future Work MACFP can efficiently identify ALL approximate frequent patterns under edit distance Specialize and apply MACFP to specific bioinformatics problems. 25