Decompositions of Matrices with Relational Data: Foundations and Algorithms

Similar documents
Handling Noise in Boolean Matrix Factorization

Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data

Data Mining and Matrices

Matrix Factorizations over Non-Conventional Algebras for Data Mining. Pauli Miettinen 28 April 2015

Optimal factorization of three-way binary data using triadic concepts

Journal of Computer and System Sciences

Implications from data with fuzzy attributes vs. scaled binary attributes

George J. Klir Radim Belohlavek, Martin Trnecka. State University of New York (SUNY) Binghamton, New York 13902, USA

George J. Klir Radim Belohlavek, Martin Trnecka. State University of New York (SUNY) Binghamton, New York 13902, USA

On Boolean factor analysis with formal concepts as factors

A Post-Nonlinear Mixture Model Approach to Binary Matrix Factorization

BOOLEAN TENSOR FACTORIZATIONS. Pauli Miettinen 14 December 2011

The matrix approach for abstract argumentation frameworks

Lecture 6: Manipulation of Algebraic Functions, Boolean Algebra, Karnaugh Maps

Fuzzy attribute logic over complete residuated lattices

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS. Pauli Miettinen TML September 2013

Interesting Patterns. Jilles Vreeken. 15 May 2015

Implications from data with fuzzy attributes

On Proofs and Rule of Multiplication in Fuzzy Attribute Logic

On factorization by similarity of fuzzy concept lattices with hedges

Lecture 4: Four Input K-Maps

Introduction to Kleene Algebras

Triadic Factor Analysis

Factor Analysis of Incidence Data via Novel Decomposition of Matrices

Sup-t-norm and inf-residuum are a single type of relational equations

MTH 309 Supplemental Lecture Notes Based on Robert Messer, Linear Algebra Gateway to Mathematics

Efficient Approximation for Restricted Biclique Cover Problems

A Weak Bisimulation for Weighted Automata

CS 484 Data Mining. Association Rule Mining 2

Encyclopedia of Machine Learning Chapter Number Book CopyRight - Year 2010 Frequent Pattern. Given Name Hannu Family Name Toivonen

Home Page. Title Page. Page 1 of 35. Go Back. Full Screen. Close. Quit

On the Mining of Numerical Data with Formal Concept Analysis

On the Exponent of the All Pairs Shortest Path Problem

CHAPTER 1. Relations. 1. Relations and Their Properties. Discussion

arxiv: v1 [cs.ir] 16 Oct 2013

Hegselmann-Krause Dynamics: An Upper Bound on Termination Time

Formal Concept Analysis as a Framework for Business Intelligence Technologies II

Expressiveness, decidability, and undecidability of Interval Temporal Logic

CS 584 Data Mining. Association Rule Mining 2

CSE 140: Components and Design Techniques for Digital Systems

Comp487/587 - Boolean Formulas

CHAPTER 10. Gentzen Style Proof Systems for Classical Logic

Modified Group Generalized Binary Search with Near-Optimal Performance Guarantees

COMP 5331: Knowledge Discovery and Data Mining

Bivalent and other solutions of fuzzy relational equations via linguistic hedges

Looking for analogical proportions in a formal concept analysis setting

Unsupervised Image Segmentation Using Comparative Reasoning and Random Walks

A Note on Tiling under Tomographic Constraints

1182 L. B. Beasley, S. Z. Song, ands. G. Lee matrix all of whose entries are 1 and =fe ij j1 i m 1 j ng denote the set of cells. The zero-term rank [5

BL-Functions and Free BL-Algebra

On Regularity of Incline Matrices

Estimates for probabilities of independent events and infinite series

Data Mining Concepts & Techniques

3 Propositional Logic

Chapter 9: Relations Relations

With Question/Answer Animations. Chapter 2

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

DATA MINING - 1DL360

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise

Propositional Logic: Models and Proofs

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

Data Mining and Analysis: Fundamental Concepts and Algorithms

Definition: A binary relation R from a set A to a set B is a subset R A B. Example:

How to Distinguish True Dependence from Varying Independence?

7 The structure of graphs excluding a topological minor

Girth Analysis of Polynomial-Based Time-Invariant LDPC Convolutional Codes

Department of Computer Science University at Albany, State University of New York Solutions to Sample Discrete Mathematics Examination II (Fall 2007)

Section Summary. Relations and Functions Properties of Relations. Combining Relations

Concept Lattices in Rough Set Theory

Multilevel Logic Synthesis Algebraic Methods

What You Will Gain By Rounding: Theory and Algorithms for Rounding Rank

Math.3336: Discrete Mathematics. Chapter 9 Relations

Lecture 7. Econ August 18

Linear Algebra March 16, 2019

Investigating Measures of Association by Graphs and Tables of Critical Frequencies

Lecture 12: Feb 16, 2017

MODEL ANSWERS TO HWK #7. 1. Suppose that F is a field and that a and b are in F. Suppose that. Thus a = 0. It follows that F is an integral domain.

Fuzzy Closure Operators with Truth Stressers

Regular Sparse Crossbar Concentrators

An Optimization-based Approach to Decentralized Assignability

The Novel Attempt for Finding Minimum Solution in Fuzzy Neutrosophic Relational Geometric Programming (FNRGP) with (max,min) Composition

On Minimal Infrequent Itemset Mining

Fuzzy filters and fuzzy prime filters of bounded Rl-monoids and pseudo BL-algebras

An Introduction of Tutte Polynomial

DATA MINING - 1DL360

Decidability of Description Logics with Transitive Closure of Roles in Concept and Role Inclusion Axioms

Definition 2.3. We define addition and multiplication of matrices as follows.

Feature Selection with Fuzzy Decision Reducts

Constructing c-ary Perfect Factors

Bias Correction in Classification Tree Construction ICML 2001

Lecture Notes for Chapter 6. Introduction to Data Mining

Interpreting Low and High Order Rules: A Granular Computing Approach

An Introduction to Modal Logic III

Binary Decision Diagrams. Graphs. Boolean Functions

Mining Approximative Descriptions of Sets Using Rough Sets

Groups Subgroups Normal subgroups Quotient groups Homomorphisms Cyclic groups Permutation groups Cayley s theorem Class equations Sylow theorems

arxiv: v2 [math.fa] 27 Sep 2016

COMP538: Introduction to Bayesian Networks

Active Measurement for Multiple Link Failures Diagnosis in IP Networks

Transcription:

Decompositions of Matrices with Relational Data: Foundations and Algorithms Martin Trnečka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC PhD dissertation defense Feb 7, 2017

Outline My research Boolean matrix factorization (BMF) Geometry of BMF Basic viewpoints Formal concept analysis and BMF Role of entries containing 1 Essential elements GreEss algorithm Experimental results

My research Boolean matrix factorization (BMF) Algorithms (19 previous, 6 new) Experimental evaluation Applications Basic level of concepts Identify important knowledge in data Citations (since 2013): Web of science: 4 (h-index: 2) Scopus: 24 (h-index: 3) Google scholar: 66 (h-index: 5)

Boolean matrix factorization (BMF) Method for analysis of Boolean data A general aim: for a given matrix I {0, 1} n m find matrices A {0, 1} n k and B {0, 1} k m for which I (approximately) equals A B is the Boolean matrix product (A B) ij = k max l=1 min(a il, B lj ). 10111 01101 01001 10110 = 110 011 001 100 10110 00101 01001 Discovery of k factors that exactly or approximately explain the data Factors = interesting patterns (rectangles) in data

Geometry of BMF Geometry of factorization coverage of the entries containing 1s by rectangles. 10111 01101 01001 10110 = 10111 01101 01001 10110 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 = 110 011 001 100 10110 00101 01001 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 Belohlavek, R., Vychodil, V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, J. of Comp. and Sys. Science 76(1)(2010), 3 20. Belohlavek, R., Trnecka, M.: From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm, J. of Comp and Sys. Sciences 81(8)(2015), 1678 1697.

Explanation of data by factors How large portion of data is explain by factors? Distance (error function) Two components of E E u and E o non-symmetric E(C, D) = C D = m,n i,j=1 C ij D ij. E(I, A B) = E u (I, A B) + E o (I, A B), where Addings factors: E u decrease, E o increase Zero E o from-bellow factorization E u (I, A B) = { i, j ; I ij = 1, (A B) ij = 0}, E o (I, A B) = { i, j ; I ij = 0, (A B) ij = 1}

Two basic viewpoints to BMF Discrete Basis Problem Given I {0, 1} n m and a positive integer k, find A {0, 1} n k and B {0, 1} k m that minimize I A B Emphasizes the importance of the first few (presumably most important) factors Miettinen, P., Mielikainen, T., Gionis, A., Das, G., Mannila, H., The discrete basis problem, IEEE Transactional Knowledge and Data Engineering 20(10)(2008), 1348 1362. Approximate Factorization Problem Given I and prescribed error ε 0, find A {0, 1} n k and B {0, 1} k m with k as small as possible such that I A B ε Emphasizes the need to account for (and thus to explain) a prescribed (presumably reasonably large) portion of data Belohlavek, R., Trnecka, M.: From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm, J. of Comp and Sys. Sciences 81(8)(2015), 1678 1697.

Optimal decomposition via formal concepts Formal context X, Y, I representation of I Formal concept A, B, A X, B Y, A = B and B = A A = {y Y x A x, y I}, B = {x X y B x, y I} Set of all formal concepts B(I) form a concept lattice (complete lattice) Formal concepts of I optimal factors of I Belohlavek, R., Vychodil, V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, J. of Comp. and Sys. Science 76(1)(2010), 3 20. Decomposition via formal concepts: For given a set F = { C 1, D 1, C 2, D 2,..., C k, D k } B(I)} of formal concepts, define n k and k m matrices A F and B F : (A F ) il = (C l )(i) and (B F ) lj = (D l )(j)

Role of entries containing 1 Reformulation in terms of labeled diagram of B(I): Some 1s are more important than others Find smallest subset F of nodes in the diagram such that whenever there exists a path from a node γ(i) up to µ(j), then some such path goes through some c F γ(i) = i, i and µ(j) = j, j According to the Basic Theorem of FCA γ(i) µ(j) iff I ij = 1 Concepts C, D that cover i, j are just those in the interval I ij = [γ(i), µ(j)]

Example 1,2,3,4,5,6 1 2 3 4 5 6 a b c d e 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 a d b 1,2,5,6 1,2,4,5 1,3,5,6 a,e a,d b,c 1,2,5 a,b 2,6 1,5,6 3,5 a,d,e a,b,e a,b,d 2 6 1,5 a,b,c,d 5 a,b,c,d,e

Role of entries containing 1 For C X and D Y I C,D = [γ(c), µ(d)] = { E, F B(I) γ(c) µ(d)} where γ(c) = C, C and µ(d) = D, D Lemma (a) I C,D is non-empty iff C D I, i.e. if I ij = 1 for every i C and j D. In particular, I ij is non-empty iff I ij = 1. (b) I C,D ={ E, F B(I) C E, D F }={ E, F B(I) C E, D F }. In particular, I ij is the set of all concepts that cover i, j. (c) If (A F B F ) ij = 1 then F contains at least one concept in I ij.

Essential elements For I {0, 1} n m denote by E(I) the n m Boolean matrix given by where denotes set inclusion (E(I)) ij = 1 iff I ij is non-empty and minimal w.r.t., I ij I i j iff γ(i ) γ(i) and µ(j) µ(j ) iff {i} {i } and {j} {j } E is easy to compute

GreEss algorithm: theory Factorizations of I may be obtained from factorizations of E(I) Theorem Let G B(E(I)) be a set of factor concepts of E(I), i.e. E(I) = A G B G. Then every set F B(I) containing for each C, D G at least one concept from I C,D is a set of factor concepts of I, i.e. I = A F B F. May be generalized to arbitrary factorizations of E(I) The estimation is not tight Theorem For every Boolean matrix I we have rank B (I) rank B (E(I)).

GreEss algorithm: basic idea GreEss algorithm 1 Compute intervals in B(I) 2 Search these intervals in a greedy manner for factor concepts of I More precisely: Compute E(I) Start computing set G of factors of E(I) (via GreConD algorithm) Stop (computing G) when the conservative estimations of the counterparts C, D G cover I, i.e. stop when I = C I I D I I C,D G Select one concept per interval I C,D for C, D G in B(I) in a greedy manner

GreEss vs. previous algorithms Previous (selected) algorithms: 8M Tiling Asso GreConD PaNDa Hyper Experimental evaluation: Quality of decomposition on synthetic data (12) Quality of decomposition on real data (16) Capability to find the original factors Interpretation of obtained factors Coverage quality for A {0, 1} n l and B {0, 1} l m c(l) = 1 E(I, A B)/ I

Synthetic data coverage 1 0.9 0.8 0.7 0.6 0.5 0.4 8M 0.3 Tiling Asso 0.2 GreConD PaNDa 0.1 Hyper GreEss 0 0 10 20 30 40 50 k (number of factors) coverage 1 0.9 0.8 0.7 0.6 0.5 0.4 8M 0.3 Tiling Asso 0.2 GreConD PaNDa 0.1 Hyper GreEss 0 0 10 20 30 40 50 k (number of factors) (a) Set C3 Figure: Coverage quality of the first k factors (b) Set C4

Real data (Mushroom) dataset coverage number of factors needed (100c%) for the prescribed coverage 8M Tiling Asso GreConD PaNDa Hyper GreEss Mushroom 25% 3 3 2 3 2 9 2 50% 5 7 6 7 NA 20 8 75% 31 24 36 24 NA 38 26 90% 85 47 NA 46 NA 57 47 95% NA 62 NA 62 NA 70 61 100% NA 120 NA 120 NA 123 105 Table: Quality of decompositions (Mushroom)

Other experiments Ground truth 8M Tiling Asso GreConD PaNDa Hyper GreEss dataset C1 0.323 0.736 0.545 0.728 0.215 0.012 0.764 C2 0.305 0.739 0.397 0.734 0.173 0.010 0.769 C3 0.326 0.719 0.185 0.706 0.076 0.007 0.732 C4 0.365 0.787 0.170 0.752 0.097 0.005 0.828 Table: Similarity of original and computed factorization Interptertation poor results 8M, Asso, PaNDa

Conclusion Geometry of factorization coverage New theoretical results essential elements New algorithm GreEss Experimental evaluation GreEss outperform selected algorithms

Decompositions of Matrices with Relational Data: Foundations and Algorithms Martin Trnečka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC PhD dissertation defense Feb 7, 2017

H. Řezanková Example, Lemma, Observation, Remark, Theorem jedinečné číslování Značení matic chápeme je jako relace Jednoznačné značení vs. původní značení každý algoritmus má jinou logiku a terminologii Tiling algoritmus 2004 Tabulka 6.1 dataset k dens A dens B dens I avg E(I) / I Set A1 20 0.11 0.05 0.1 0.0434±0.0043 Set A2 20 0.11 0.10 0.2 0.0164±0.0019 Set A3 20 0.15 0.12 0.3 0.0048±0.0006 Set A4 20 0.17 0.15 0.4 0.0021±0.0005 Za symbolem ± výběrová směrodatná odchylka (strana 68) Cíl práce

J. Rauch Outrata, J., Trnecka, M.: Evaluating Association Rules in Boolean Matrix Factorization, Workshop on Computational Intelligence and Data Mining, WCIDM 2016, In Proc. of the 16th ITAT conference, CEUR Workshop Proceedings Vol. 1649, pp. 147 154. Asso algoritmus confidence (parametr τ) Inspirace metodou GUHA Asociační pravidlo (na množině atributů matice I) i j Čtyřpolní tabulka I j j i a = fr(i j) b = fr(i j) i c = fr( i j) d = fr( i j). Zobecněné kvantifikátory (fundovaná implikace, dvojitá fundovaná implikace a další) Algoritmus GreConD atributové implikace

coverage 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 founded implication 0.2 double founded implication founded equivalence 0.1 negative Jaccard distance E equivalence 0 0 5 10 15 20 25 30 35 40 number of factors overcoverage 2 1.8 1.6 1.4 1.2 1 0.8 0.6 founded implication 0.4 double founded implication founded equivalence 0.2 negative Jaccard distance E equivalence 0 0 5 10 15 20 25 30 35 40 number of factors Figure: Chyba pokrytí pro C 1 Figure: Chyba překrytí pro C 1

Decompositions of Matrices with Relational Data: Foundations and Algorithms Martin Trnečka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC PhD dissertation defense Feb 7, 2017