Mining Frequent Closed Unordered Trees Through Natural Representations

Similar documents
Adaptive Learning and Mining for Data Streams and Frequent Patterns

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Information Theory and Distribution Modeling

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Intrusion Detection and Malware Analysis

Covering Linear Orders with Posets

Comp487/587 - Boolean Formulas

Computation Theory Finite Automata

ASSOCIATION ANALYSIS FREQUENT ITEMSETS MINING. Alexandre Termier, LIG

WAM-Miner: In the Search of Web Access Motifs from Historical Web Log Data

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

UNIT I INFORMATION THEORY. I k log 2

D B M G Data Base and Data Mining Group of Politecnico di Torino

Information Theory and Statistics Lecture 2: Source coding

Association Rules. Fundamentals

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Graph Mining Methods for Predictive Toxicology

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

NP-Complete Problems. Complexity Class P. .. Cal Poly CSC 349: Design and Analyis of Algorithms Alexander Dekhtyar..

Efficient Reassembling of Graphs, Part 1: The Linear Case

CPT+: A Compact Model for Accurate Sequence Prediction

1. Prove: A full m- ary tree with i internal vertices contains n = mi + 1 vertices.

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

What is this course about?

Lecture 1 : Data Compression and Entropy

Data Mining and Analysis: Fundamental Concepts and Algorithms

XRules: An Effective Structural Classifier for XML Data

Compressed Fisher vectors for LSVR

Linear Classifiers (Kernels)

Dictionary: an abstract data type

Hierarchical Overlap Graph

Algorithm for Enumerating All Maximal Frequent Tree Patterns among Words in Tree-Structured Documents and Its Application

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 18 April 26, 2012

Algorithm Theory - Exercise Class

Slides for CIS 675. Huffman Encoding, 1. Huffman Encoding, 2. Huffman Encoding, 3. Encoding 1. DPV Chapter 5, Part 2. Encoding 2

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

CS6901: review of Theory of Computation and Algorithms

Enhancing Active Automata Learning by a User Log Based Metric

Graph-theoretic Problems

1 Introduction to information theory

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Greedy Trees, Caterpillars, and Wiener-Type Graph Invariants

Data Compression Techniques

An O(N) Semi-Predictive Universal Encoder via the BWT

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Context-Free Languages

Tree-adjoined spaces and the Hawaiian earring

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

On the minimum neighborhood of independent sets in the n-cube

A An Overview of Complexity Theory for the Algorithm Designer

Digital search trees JASS

WEIGHTS OF TESTS Vesela Angelova

HW #4. (mostly by) Salim Sarımurat. 1) Insert 6 2) Insert 8 3) Insert 30. 4) Insert S a.

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

E D I C T The internal extent formula for compacted tries

Treedy: A Heuristic for Counting and Sampling Subsets

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Lecture 4 : Adaptive source coding algorithms

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

Object Detection Grammars

arxiv: v5 [cs.fl] 21 Feb 2012

Tree Adjoining Grammars

Learning Decision Trees

Model Checking for Propositions CS477 Formal Software Dev Methods

CS60007 Algorithm Design and Analysis 2018 Assignment 1

Non-context-Free Languages. CS215, Lecture 5 c

Dynamic Programming on Trees. Example: Independent Set on T = (V, E) rooted at r V.

Nonlinear Dimensionality Reduction. Jose A. Costa

Cardinality Networks: a Theoretical and Empirical Study

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Kernels for small molecules

Binary Decision Diagrams. Graphs. Boolean Functions

Theory of Computation

Search and Lookahead. Bernhard Nebel, Julien Hué, and Stefan Wölfl. June 4/6, 2012

CONVOLUTION TREES AND PASCAL-T TRIANGLES. JOHN C. TURNER University of Waikato, Hamilton, New Zealand (Submitted December 1986) 1.

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

3 Greedy Algorithms. 3.1 An activity-selection problem

Parikh s theorem. Håkan Lindqvist

Partial cubes: structures, characterizations, and constructions

Fast Kernels for String and Tree Matching

NP-completeness. Chapter 34. Sergey Bereg

Logic and Computation

Support Vector Machines

d(ν) = max{n N : ν dmn p n } N. p d(ν) (ν) = ρ.

Advanced Techniques for Mining Structured Data: Process Mining

Binary Search Trees. Motivation

Space-Efficient Construction Algorithm for Circular Suffix Tree

Phylogenetic Networks, Trees, and Clusters

FP-growth and PrefixSpan

Propositional and Predicate Logic - IV

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Theoretical Computer Science

Transcription:

Mining Frequent Closed Unordered Trees Through Natural Representations José L. Balcázar, Albert Bifet and Antoni Lozano Universitat Politècnica de Catalunya Pascal Workshop: learning from and with graphs 2007 Alicante

Trees are sanctuaries. Whoever knows how to listen to them, can learn the truth. Herman Hesse Mining frequent trees is becoming an important task Applications: chemical informatics computer vision text retrieval bioinformatics Web analysis. Many link-based structures may be studied formally by means of unordered trees

Introduction Unordered Trees One unordered tree with two different drawings, each of which corresponds to a different ordered tree.

Introduction Induced subtrees: obtained by repeatedly removing leaf nodes Embedded subtrees: obtained by contracting some of the edges

Introduction What Is Tree Pattern Mining? Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a super-tree with the same support CT FT Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information

Introduction Ordered Subtree Mining D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 8 Closed Subtrees: X, Y Frequent Subtrees:

Introduction Unordered Subtree Mining A: B: X: Y: D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 9 Closed Subtrees: X, Y Frequent Subtrees:

Related Work Yu Chi, Richard Muntz, Siegfried Nijssen, Joost Kok Frequent Subtree Mining-An overview 2005 FREQUENT Labelled and Rooted Trees UnOrdered Induced Unot [Asai 2003] UFreqT [Nijssen 2003] HybridTreeMiner [Chi 2004] PathJoin [Xiao 2003] CLOSED Labelled and Induced Trees: CMTREEMINER [Chi, Yang, Xia, Muntz 2004] Labelled and relaxed included Trees: DRYADE [Termier, Rousset, Sebag 2004] Labelled and Attribute Trees: CLOATT [Arimura, Uno 2005]

Natural Representation Definition Given two sequences of natural numbers x, y x y: concatenation of x and y x + i: addition of i to each component of x x + = x + 1 Definition A natural sequence is a sequence (x 1,..., x n ) of natural numbers such that x 1 = 0 each subsequent number x i+1 belongs to the range 1 x i+1 x i + 1. Example x = (0, 1, 2, 3, 1, 2) = (0) (0, 1, 2) + (0, 1) +

Natural Representation Definition Let t be an ordered tree. If t is a single node, then t = (0). Otherwise, if t is composed of the trees t 1,..., t k joined to a common root r (where the ordering t 1,..., t k is the same of the children of r), then Example x = (0, 1, 2, 2, 3, 1) t = (0) t 1 + t 2 + t k + t is the natural representation of t. (0) (0, 1, 1, 2) + (0) +

Mining frequent subtrees in the ordered case Definition y is a one-step extension of x (in symbols, x 1 y) if x is a prefix of y and y = x + 1. a series of one-step extensions from (0) to a natural sequence x (0) 1 x 1 1 1 x k 1 1 x always exists and must be unique, since the x i s can only be the prefixes of x.

Mining frequent subtrees in the ordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. insert t into T for every t that can be extended from t in one step do if support(t ) min_sup then FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T

Mining frequent subtrees in the ordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. insert t into T 1 C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C 2 for each t in C 3 do T FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T

Canonical Forms Definition Let t be an unordered tree, and let t 1,..., t n be all the ordered trees obtained from t by ordering in all possible ways all the sets of siblings of t. The canonical representative of t is the ordered tree t 0 whose natural representation is maximal (according to lexicographic ordering) among the natural representations of the trees t i, that is, such that t 0 = max{ t i 1 i n}.

Mining frequent subtrees in the unordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. insert t into T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C for each t in C do T FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T

Mining frequent subtrees in the unordered case FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) Input: A tree t, a tree dataset D, and min_sup. Output: The frequent tree set T. 1 if not CANONICAL_REPRESENTATIVE(t) 2 then return T insert t into T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C for each t in C do T FREQUENT_SUBTREE_MINING(t, D, min_sup, T ) return T

Closure-based mining CLOSED_SUBTREE_MINING(t, D, min_sup, T ) if not CANONICAL_REPRESENTATIVE(t) then return T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C for each t in C do T CLOSED_SUBTREE_MINING(t, D, min_sup, T ) return T

Closure-based mining CLOSED_SUBTREE_MINING(t, D, min_sup, T ) if not CANONICAL_REPRESENTATIVE(t) then return T C for every t that can be extended from t in one step do if support(t ) min_sup then insert t into C 1 do if support(t ) = support(t) 2 then t is not closed 3 if t is closed 4 then insert t into T for each t in C do T CLOSED_SUBTREE_MINING(t, D, min_sup, T ) return T

Example: Ordered Case min_sup = 2 A : (0, 1, 2, 3, 2, 1), B : (0, 1, 2, 3, 1, 2, 2) (0) 1 (0, 1) 1 (0, 1, 2) 1 (0, 1, 2, 2) 1 (0, 1, 2, 3) 1 (0, 1, 2, 3, 1)

Example: Unordered Case min_sup = 2 A : (0, 1, 2, 3, 2, 1), B : (0, 1, 2, 3, 1, 2, 2) A: B: X: Y: (0) 1 (0, 1) 1 (0, 1, 2) 1 (0, 1, 2, 2) 1 (0, 1, 2, 2, 1) 1 (0, 1, 2, 3) 1 (0, 1, 2, 3, 1)

Experiments: Gazelle Unordered Trees 10 8 CMTreeMiner Our method Time 6 4 2 0 0 5 10 15 20 25 30 35 40 Support x 1000

Conclusions and Future Work Through our proposed representation of ordered trees, we have presented efficient algorithms for mining ordered and unordered frequent closed trees. The sequential form of our representation, where the number-encoded depth furnishes the two-dimensional information, is key in the fast processing of the data. Future work : Consider labelled subtrees Consider embedded subtrees

Future Work

Tree Kernels Definition (Subset Trees) Set of connected nodes of a tree T Definition (Colins and Duffy 2001) Denote by T, T trees and by t T a subset tree of T, then k(t, T ) = w t δ t,t t T,t T Definition (Vishwanathan and Smola 2002) In case we count matching subtrees then t T denotes that t is a subtree of T and k(t, T ) = w t δ t,t t T,t T

Tree Kernels S. V. N. Vishwanathan and Alexander J. Smola. Fast Kernels for String and Tree Matching 2002 We can compute tree kernel by Converting trees to strings Computing string kernels Advantages Simple storage and simple implementation (dynamic array, suffices) All speedups for strings work for tree kernels, too(xml documents,etc.)