Decompositions of Matrices with Relational Data: Foundations and Algorithms

Decompositions of Matrices with Relational Data: Foundations and Algorithms Martin Trnečka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC PhD dissertation defense Feb 7, 2017

Outline My research Boolean matrix factorization (BMF) Geometry of BMF Basic viewpoints Formal concept analysis and BMF Role of entries containing 1 Essential elements GreEss algorithm Experimental results

My research Boolean matrix factorization (BMF) Algorithms (19 previous, 6 new) Experimental evaluation Applications Basic level of concepts Identify important knowledge in data Citations (since 2013): Web of science: 4 (h-index: 2) Scopus: 24 (h-index: 3) Google scholar: 66 (h-index: 5)

Boolean matrix factorization (BMF) Method for analysis of Boolean data A general aim: for a given matrix I {0, 1} n m find matrices A {0, 1} n k and B {0, 1} k m for which I (approximately) equals A B is the Boolean matrix product (A B) ij = k max l=1 min(a il, B lj ). 10111 01101 01001 10110 = 110 011 001 100 10110 00101 01001 Discovery of k factors that exactly or approximately explain the data Factors = interesting patterns (rectangles) in data

Geometry of BMF Geometry of factorization coverage of the entries containing 1s by rectangles. 10111 01101 01001 10110 = 10111 01101 01001 10110 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 = 110 011 001 100 10110 00101 01001 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 Belohlavek, R., Vychodil, V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, J. of Comp. and Sys. Science 76(1)(2010), 3 20. Belohlavek, R., Trnecka, M.: From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm, J. of Comp and Sys. Sciences 81(8)(2015), 1678 1697.

Explanation of data by factors How large portion of data is explain by factors? Distance (error function) Two components of E E u and E o non-symmetric E(C, D) = C D = m,n i,j=1 C ij D ij. E(I, A B) = E u (I, A B) + E o (I, A B), where Addings factors: E u decrease, E o increase Zero E o from-bellow factorization E u (I, A B) = { i, j ; I ij = 1, (A B) ij = 0}, E o (I, A B) = { i, j ; I ij = 0, (A B) ij = 1}

Two basic viewpoints to BMF Discrete Basis Problem Given I {0, 1} n m and a positive integer k, find A {0, 1} n k and B {0, 1} k m that minimize I A B Emphasizes the importance of the first few (presumably most important) factors Miettinen, P., Mielikainen, T., Gionis, A., Das, G., Mannila, H., The discrete basis problem, IEEE Transactional Knowledge and Data Engineering 20(10)(2008), 1348 1362. Approximate Factorization Problem Given I and prescribed error ε 0, find A {0, 1} n k and B {0, 1} k m with k as small as possible such that I A B ε Emphasizes the need to account for (and thus to explain) a prescribed (presumably reasonably large) portion of data Belohlavek, R., Trnecka, M.: From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm, J. of Comp and Sys. Sciences 81(8)(2015), 1678 1697.

Optimal decomposition via formal concepts Formal context X, Y, I representation of I Formal concept A, B, A X, B Y, A = B and B = A A = {y Y x A x, y I}, B = {x X y B x, y I} Set of all formal concepts B(I) form a concept lattice (complete lattice) Formal concepts of I optimal factors of I Belohlavek, R., Vychodil, V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, J. of Comp. and Sys. Science 76(1)(2010), 3 20. Decomposition via formal concepts: For given a set F = { C 1, D 1, C 2, D 2,..., C k, D k } B(I)} of formal concepts, define n k and k m matrices A F and B F : (A F ) il = (C l )(i) and (B F ) lj = (D l )(j)

Role of entries containing 1 Reformulation in terms of labeled diagram of B(I): Some 1s are more important than others Find smallest subset F of nodes in the diagram such that whenever there exists a path from a node γ(i) up to µ(j), then some such path goes through some c F γ(i) = i, i and µ(j) = j, j According to the Basic Theorem of FCA γ(i) µ(j) iff I ij = 1 Concepts C, D that cover i, j are just those in the interval I ij = [γ(i), µ(j)]

Example 1,2,3,4,5,6 1 2 3 4 5 6 a b c d e 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 a d b 1,2,5,6 1,2,4,5 1,3,5,6 a,e a,d b,c 1,2,5 a,b 2,6 1,5,6 3,5 a,d,e a,b,e a,b,d 2 6 1,5 a,b,c,d 5 a,b,c,d,e

Role of entries containing 1 For C X and D Y I C,D = [γ(c), µ(d)] = { E, F B(I) γ(c) µ(d)} where γ(c) = C, C and µ(d) = D, D Lemma (a) I C,D is non-empty iff C D I, i.e. if I ij = 1 for every i C and j D. In particular, I ij is non-empty iff I ij = 1. (b) I C,D ={ E, F B(I) C E, D F }={ E, F B(I) C E, D F }. In particular, I ij is the set of all concepts that cover i, j. (c) If (A F B F ) ij = 1 then F contains at least one concept in I ij.

Essential elements For I {0, 1} n m denote by E(I) the n m Boolean matrix given by where denotes set inclusion (E(I)) ij = 1 iff I ij is non-empty and minimal w.r.t., I ij I i j iff γ(i ) γ(i) and µ(j) µ(j ) iff {i} {i } and {j} {j } E is easy to compute

GreEss algorithm: theory Factorizations of I may be obtained from factorizations of E(I) Theorem Let G B(E(I)) be a set of factor concepts of E(I), i.e. E(I) = A G B G. Then every set F B(I) containing for each C, D G at least one concept from I C,D is a set of factor concepts of I, i.e. I = A F B F. May be generalized to arbitrary factorizations of E(I) The estimation is not tight Theorem For every Boolean matrix I we have rank B (I) rank B (E(I)).

GreEss algorithm: basic idea GreEss algorithm 1 Compute intervals in B(I) 2 Search these intervals in a greedy manner for factor concepts of I More precisely: Compute E(I) Start computing set G of factors of E(I) (via GreConD algorithm) Stop (computing G) when the conservative estimations of the counterparts C, D G cover I, i.e. stop when I = C I I D I I C,D G Select one concept per interval I C,D for C, D G in B(I) in a greedy manner

GreEss vs. previous algorithms Previous (selected) algorithms: 8M Tiling Asso GreConD PaNDa Hyper Experimental evaluation: Quality of decomposition on synthetic data (12) Quality of decomposition on real data (16) Capability to find the original factors Interpretation of obtained factors Coverage quality for A {0, 1} n l and B {0, 1} l m c(l) = 1 E(I, A B)/ I

Synthetic data coverage 1 0.9 0.8 0.7 0.6 0.5 0.4 8M 0.3 Tiling Asso 0.2 GreConD PaNDa 0.1 Hyper GreEss 0 0 10 20 30 40 50 k (number of factors) coverage 1 0.9 0.8 0.7 0.6 0.5 0.4 8M 0.3 Tiling Asso 0.2 GreConD PaNDa 0.1 Hyper GreEss 0 0 10 20 30 40 50 k (number of factors) (a) Set C3 Figure: Coverage quality of the first k factors (b) Set C4

Real data (Mushroom) dataset coverage number of factors needed (100c%) for the prescribed coverage 8M Tiling Asso GreConD PaNDa Hyper GreEss Mushroom 25% 3 3 2 3 2 9 2 50% 5 7 6 7 NA 20 8 75% 31 24 36 24 NA 38 26 90% 85 47 NA 46 NA 57 47 95% NA 62 NA 62 NA 70 61 100% NA 120 NA 120 NA 123 105 Table: Quality of decompositions (Mushroom)

Other experiments Ground truth 8M Tiling Asso GreConD PaNDa Hyper GreEss dataset C1 0.323 0.736 0.545 0.728 0.215 0.012 0.764 C2 0.305 0.739 0.397 0.734 0.173 0.010 0.769 C3 0.326 0.719 0.185 0.706 0.076 0.007 0.732 C4 0.365 0.787 0.170 0.752 0.097 0.005 0.828 Table: Similarity of original and computed factorization Interptertation poor results 8M, Asso, PaNDa

Conclusion Geometry of factorization coverage New theoretical results essential elements New algorithm GreEss Experimental evaluation GreEss outperform selected algorithms

Decompositions of Matrices with Relational Data: Foundations and Algorithms Martin Trnečka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC PhD dissertation defense Feb 7, 2017

H. Řezanková Example, Lemma, Observation, Remark, Theorem jedinečné číslování Značení matic chápeme je jako relace Jednoznačné značení vs. původní značení každý algoritmus má jinou logiku a terminologii Tiling algoritmus 2004 Tabulka 6.1 dataset k dens A dens B dens I avg E(I) / I Set A1 20 0.11 0.05 0.1 0.0434±0.0043 Set A2 20 0.11 0.10 0.2 0.0164±0.0019 Set A3 20 0.15 0.12 0.3 0.0048±0.0006 Set A4 20 0.17 0.15 0.4 0.0021±0.0005 Za symbolem ± výběrová směrodatná odchylka (strana 68) Cíl práce

J. Rauch Outrata, J., Trnecka, M.: Evaluating Association Rules in Boolean Matrix Factorization, Workshop on Computational Intelligence and Data Mining, WCIDM 2016, In Proc. of the 16th ITAT conference, CEUR Workshop Proceedings Vol. 1649, pp. 147 154. Asso algoritmus confidence (parametr τ) Inspirace metodou GUHA Asociační pravidlo (na množině atributů matice I) i j Čtyřpolní tabulka I j j i a = fr(i j) b = fr(i j) i c = fr( i j) d = fr( i j). Zobecněné kvantifikátory (fundovaná implikace, dvojitá fundovaná implikace a další) Algoritmus GreConD atributové implikace

coverage 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 founded implication 0.2 double founded implication founded equivalence 0.1 negative Jaccard distance E equivalence 0 0 5 10 15 20 25 30 35 40 number of factors overcoverage 2 1.8 1.6 1.4 1.2 1 0.8 0.6 founded implication 0.4 double founded implication founded equivalence 0.2 negative Jaccard distance E equivalence 0 0 5 10 15 20 25 30 35 40 number of factors Figure: Chyba pokrytí pro C 1 Figure: Chyba překrytí pro C 1

Decompositions of Matrices with Relational Data: Foundations and Algorithms Martin Trnečka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC PhD dissertation defense Feb 7, 2017