MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance

Similar documents
Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

On the Monotonicity of the String Correction Factor for Words with Mismatches

Sequence analysis and Genomics

Bloom Filters, Minhashes, and Other Random Stuff

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints

Complexity of Biomolecular Sequences

A metric approach for. comparing DNA sequences

Genomes and Their Evolution

EVOLUTIONARY DISTANCES

Sequence Comparison. mouse human

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Lecture 2: Pairwise Alignment. CG Ron Shamir

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Pattern Matching (Exact Matching) Overview

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

Motivating the need for optimal sequence alignments...

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data

An Introduction to Sequence Similarity ( Homology ) Searching

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

L3: Blast: Keyword match basics

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

1.5 Sequence alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Introduction to spectral alignment

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Tandem Mass Spectrometry: Generating function, alignment and assembly

Capacity and Expressiveness of Genomic Tandem Duplication

Computational Biology

BIOINFORMATICS: An Introduction

Dynamic Programming: Edit Distance

Sequence Alignment (chapter 6)

Practical considerations of working with sequencing data

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Bio nformatics. Lecture 3. Saad Mneimneh

Outline. Two-batch liar games on a general bounded channel. Paul s t-ary questions: binary case. Basic liar game setting

Approximation: Theory and Algorithms

1 Alphabets and Languages

Exhaustive search. CS 466 Saurabh Sinha

Sequence Database Search Techniques I: Blast and PatternHunter tools

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Local Alignment: Smith-Waterman algorithm

EECS730: Introduction to Bioinformatics

Analysis and Design of Algorithms Dynamic Programming

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Molecular evolution - Part 1. Pawan Dhar BII

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Linear-Space Alignment

BLAST: Basic Local Alignment Search Tool

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Lecture 5,6 Local sequence alignment

On-line String Matching in Highly Similar DNA Sequences

Hidden Markov Models

Introduction to Bioinformatics

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

ALGORITHMS FOR COMPUTATIONAL BIOLOGY: SEQUENCE ANALYSIS

Outline. Similarity Search. Outline. Motivation. The String Edit Distance

Bioinformatics and BLAST

Average Case Analysis. October 11, 2011

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Algorithms in Bioinformatics

UNIT 5. Protein Synthesis 11/22/16

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

O 3 O 4 O 5. q 3. q 4. Transition

Similarity Search. The String Edit Distance. Nikolaus Augsten. Free University of Bozen-Bolzano Faculty of Computer Science DIS. Unit 2 March 8, 2012

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

CSE 5243 INTRO. TO DATA MINING

Gibbs Sampling Methods for Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

The breakpoint distance for signed sequences

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Jumbled String Matching: Motivations, Variants, Algorithms

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Similarity Search. The String Edit Distance. Nikolaus Augsten.

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Mutual information content of homologous DNA sequences

Fundamentals of Similarity Search

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Trace Reconstruction Revisited

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Theoretical Computer Science. Rewriting rule chains modeling DNA rearrangement pathways

Pattern Structures 1

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Transcription:

MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang

2 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

3 Why Mining Consecutive Patterns? Many data are interesting on the linear structure level DNA, RNA, and Protein sequences People are interested in consecutive substrings

Why Approximate & Edit Distance? 4 Suppose we have the following three DNA sequences in database, and the minimum support threshold (σ) is set to 3 None of them will be treated as a frequent pattern...accgtgtaggtcg......accgtttaggtcg......acggtgtaggtcg... However, comparing to the total length of these three DNA sequence, the only different position is quite small and tolerant They are insertions, deletions and mutations Insertions and deletions are very common in DNA Hamming Distance cannot take care of them Edit Distance is the best fit

5 Why Maximal? The total number of possible patterns is O(n 2 ), where n is the length of the string Tooooo expensive when n grows to a million or a billion The total number of maximal patterns is O(n), which is acceptable

6 Related Work Related work Exact Match: Suffix Tree/Array Long Pattern: Pattern Fusion Hamming Distance: REPuter

7 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

8 Definitions: Basic S: a string of length S = n Σ: the alphabets set, for DNA, Σ = 4 S i : the i-th character of S S i,j : the substring starting from i and ending at j d s, t the edit distance between strings s and t

9 Definitions: Equivalent Neighbors Two substring S i,j and S x,y d S i,j, S x,y k k is the edit distance threshold~o log n Examples k = 2 ACGACA and ACGTACG are neighbors AACCGA and ACCAAG are not

10 Definitions: Approximate Support All neighbors redundant Disjoint neighbors Our choice

Our Goal: All Frequent & Maximal Long enough At least L Approximately Frequent approximate support σ Maximal Goal: Find ALL these maximal approximate frequent patterns 11

12 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

13 MACFP: Support Checking Framework Suppose there is an oracle, which can tell us the approximate support of the substring S i,j We need only O(n) times of queries

14 MACFP: Fast Chunk Indexing Edit Distance k Segment S i,j into k + 1 chunks At least one of these chunks should be exactly matched

15 MACFP: Efficient Expanding Dynamic Programming Algorithm Edit Distance between S and T If S 1 = T 1, d S 1,i, T 1,j = d S 2,i, T 2,j We can adopt this idea to greedily match two strings Exponential to k Fortunately, k is usually small!

16 MACFP: Lower Bound Pruning

17 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

18 Experiments: Compared Methods TDP dynamic programming-based method TDP+ applies Fast Chunk Indexing technique to accelerate TDP MACFP turns off Lower Bound Pruning technique in MACFP MACFP our proposed algorithm

Experiments: Edit Distance Exponential to k The growth of running time is slower than that of total number of patterns! 19

Experiments: Length Threshold Faster for larger L Because we have Fast Chunk Indexing 20

21 Experiments: Length of DNA Seq Scalable!

22 Outline Motivation Problem Definition MACFP: Chunking, Expansion, and Pruning Experimental Results Application

Application: Generation length-n normal DNA sequence S length-m fatal subsequence s s is duplicated for T times We allow at most 1 edit distance (10% probability per edit type) for potential variation in each copy The new (patient) DNA sequence is denoted by P. Random access gene subsequences from patient Hot region After MACFP, Only using maximal frequent patterns Hot region Normal Gene Fatal Gene Subsequence Fatal Gene Subsequence 23

24 Application: Real World Scenarios RMC read mapping and counting short tandem repeat n = 10,000 m = 50 T = 100 copy number variation n = 10,000 m = 1,000 T = 20

Conclusion & Future Work MACFP can efficiently identify ALL approximate frequent patterns under edit distance Specialize and apply MACFP to specific bioinformatics problems. 25