Similrity Srh Th Binry Brnh Distn Nikolus Augstn nikolus.ugstn@sg..t Dpt. of Computr Sins Univrsity of Slzurg http://rsrh.uni-slzurg.t Vrsion Jnury 11, 2017 Wintrsmstr 2016/2017 Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 1 / 28
Outlin 1 Binry Brnh Distn Binry Rprsnttion of Tr Binry Brnhs Lowr Boun for th Eit Distn Complxity Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 2 / 28
Outlin Binry Brnh Distn Binry Rprsnttion of Tr 1 Binry Brnh Distn Binry Rprsnttion of Tr Binry Brnhs Lowr Boun for th Eit Distn Complxity Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 3 / 28
Binry Tr Binry Brnh Distn Binry Rprsnttion of Tr In inry tr h no hs t most two hilrn; lft hil n right hil r istinguish: no n hv right hil without hving lft hil; Nottion: T B = (N, E l, E r ) T B nots inry tr N r th nos of th inry tr E l n E r r th gs to th lft n right hilrn, rsptivly Full inry tr: inry tr h no hs xtly zro or two hilrn. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 4 / 28
Exmpl: Binry Tr Binry Brnh Distn Binry Rprsnttion of Tr Two iffrnt inry trs: T B = (N, E l, E r ) T B1 = ({,,,,, f, g}, {(, ), (, ), (, ), (, f )}, {(, ), (, g)}) T B2 = ({,,,,, f, g}, {(, ), (, ), (, f )}, {(, ), (, ), (, g)}) T B1 T B2 f g f g A full inry tr: h i f g Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 5 / 28
Binry Brnh Distn Binry Rprsnttion of Tr Binry Rprsnttion of Tr Binry tr trnsformtion: (i) link ll nighoring silings in tr with gs (ii) lt ll prnt-hil gs xpt th g to th first hil Trnsformtion mintins ll informtion strutur informtion Originl tr n ronstrut from th inry tr: lft g rprsnts prnt-hil rltionships in th originl tr right gs rprsnts right-siling rltionship in th originl tr Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 6 / 28
Binry Brnh Distn Binry Rprsnttion of Tr Exmpl: Binry Tr Trnsformtion Rprsnt tr T s inry tr: T inry rprsnttion of T Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 7 / 28
Binry Brnh Distn Binry Rprsnttion of Tr Normliz Binry Tr Rprsnttion W xtn th inry tr with null nos s follows: null no for h missing lft hil of non-null no null no for h missing right hil of non-null no Not: Lf nos gt two null-hilrn. Th rsulting normliz inry rprsnttion is full inry tr ll non-null nos hv two hilrn ll lvs r null-nos (n ll null-nos r lvs) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 8 / 28
Binry Brnh Distn Binry Rprsnttion of Tr Exmpl: Normliz Binry Tr Trnsforming T to th normliz inry tr B(T): T B(T) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 9 / 28
Outlin Binry Brnh Distn Binry Brnhs 1 Binry Brnh Distn Binry Rprsnttion of Tr Binry Brnhs Lowr Boun for th Eit Distn Complxity Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 10 / 28
Binry Brnh Binry Brnh Distn Binry Brnhs A inry rnh BiB(v) is sutr of th normliz inry tr B(T) onsisting of non-null no v n its two hilrn Exmpl: BiB() = ({,, }, {(, )}, {(, )}) BiB() = ({, 1, 2 }, {(, 1 )}, {(, 2 )}) 1 1 2 1 Although th two null nos hv intil lls (), thy r iffrnt nos. W mphsiz this y showing thir IDs in susript. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 11 / 28
Binry Brnh Distn Binry Brnhs Binry Brnhs of Trs n Dtsts Binry rnhs n sriliz s strings: BiB(v) = ({v,, }, {(v, )}, {(v, )}) λ(v) λ() λ() w n sort ths strings ( > λ(v) for ll non-null nos v) Binry rnh sts: BiB(T) is th st of ll inry rnhs of B(T) BiB(S) = T S BiB(T) is th st of ll inry rnhs of tst S BiB sort (S) is th vtor of sort sriliz strings of BiB(S) Not: nos r uniqu in th tr, thus inry rnhs r uniqu lls r not uniqu, thus th sriliz inry rnhs r not uniqu Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 12 / 28
Binry Brnh Distn Binry Brnhs Exmpl: Binry Brnhs of Trs n Dtsts 1 3 T 1 T 2 4 6 BiB( 1 ) BiB( 4 ): BiB( 1 ) = ({ 1, 2, 3 }, {( 1, 2 )}, {( 1, 3 )}) BiB( 4 ) = ({ 4, 5, 6 }, {( 4, 5 )}, {( 4, 6 )}) Sriliztion of oth, BiB( 1 ) n BiB( 2 ), is intil: Sort vtor of sriliz strings of BiB(S), whr S = {T 1, T 2 }: BiB sort (S) = (,,,,,,,,, ) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 13 / 28
Binry Brnh Vtor Binry Brnh Distn Binry Brnhs Th inry rnh vtor BBV (T) is rprsnttion of th inry rnh st BiB(T) Constrution of th inry rnh vtor BBV (T): omput BiB sort (S) (sriliz n sort BiB(S)) i is th i-th sriliz inry rnh in sort orr ( i = BiB sort (S)[i]) BBV (T)[i]) is th numr of inry rnhs in B(T) tht sriliz to i Not: BBV (T)[i] is zro if i os not ppr in BiB(T) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 14 / 28
Binry Brnh Distn Exmpl: Binry Brnh Vtors Binry Brnhs T 1 T2 S = {T 1, T 2 } is th t st BiB sort (S) is th vtor of sort sriliz strings of BiB(S) BBV (T i ) is th inry rnh vtor of T i th vtor of sriliz strings n th inry rnh vtors r: BiB sort (S) BBV (T 1 ) BBV (T 2 ) 1 1 0 1 0 2 0 0 2 1 1 0 1 0 1 2 1 1 0 2 Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 15 / 28
Outlin Binry Brnh Distn Lowr Boun for th Eit Distn 1 Binry Brnh Distn Binry Rprsnttion of Tr Binry Brnhs Lowr Boun for th Eit Distn Complxity Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 16 / 28
Binry Brnh Distn Binry Brnh Distn [YKT05] Lowr Boun for th Eit Distn Dfinition (Binry Brnh Distn) Lt BBV (T) = ( 1,..., k ) n BBV (T ) = ( 1,..., k ) inry rnh vtors of trs T n T, rsptivly. Th inry rnh istn of T n T is k δ B (T, T ) = i i. i=1 Intuition: W ount th inry rnhs tht o not mth twn th two trs. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 17 / 28
Binry Brnh Distn Exmpl: Binry Brnh Distn Lowr Boun for th Eit Distn W omput th inry rnh istn twn T 1 n T 2 : T 1 T 2 Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 18 / 28
Binry Brnh Distn Lowr Boun for th Eit Distn Exmpl: Binry Brnh Distn Th normliz inry tr rprsnttions r: B (T 1 ) B (T 2 ) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 19 / 28
Binry Brnh Distn Exmpl: Binry Brnh Distn Lowr Boun for th Eit Distn Th inry rnh vtors of T 1 n T 2 r: BiB sort (S) BBV (T 1 ) BBV (T 2 ) Th inry rnh istn is 1 1 0 1 0 2 0 0 2 1 1 0 1 0 1 2 1 1 0 2 δ B (T 1, T 2 ) = 10 i=1 1,i 2,i = 1 1 + 1 0 + 0 1 + 1 0 + 0 1 + 2 2 + 0 1 + 0 1 + 2 0 + 1 2 = 9, whr 1,i n 2,i r th i-th imnsion of th vtors BBV (T 1 ) n BBV (T 2 ), rsptivly. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 20 / 28
Binry Brnh Distn Lowr Boun Thorm Lowr Boun for th Eit Distn Thorm (Lowr Boun) Lt T n T two trs. If th tr it istn twn T n T is δ t (T, T ), thn th inry rnh istn twn thm stisfis δ B (T, T ) 5 δ t (T, T ). Proof (Skth Full Proof in [YKT05]). Eh no v pprs in t most two inry rnhs. Rnm: Rnming no uss t most two inry rnhs in h tr to mismth. Th sum is 4. Similr rtionl for insrt n its omplmntry oprtion lt (t most 5 inry rnhs mismth). Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 21 / 28
Binry Brnh Distn Lowr Boun for th Eit Distn Proof Skth: Illustrtion for Rnm trnsform T 1 to T 2 : rn(, x) f g inry trs B(T 1 ) n B(T 2 ) f g x f g Two inry rnhs (, g) xist only in B(T 1 ) Two inry rnhs (x, xg) xist only in B(T 2 ) δ t (T 1, T 2 ) = 1 (1 rnm) δ B (T 1, T 2 ) = 4 (4 inry rnhs iffrnt) f x g Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 22 / 28
Binry Brnh Distn Proof Skth: Illustrtion for Insrt trnsform T 1 to T 2 : ins(x,, 2, 3) f g inry trs B(T 1 ) n B(T 2 ) f g x f g f x g Lowr Boun for th Eit Distn Two inry rnhs (, f g) xist only in B(T 1 ) Tr inry rnhs (x, f, xg) xist only in B(T 2 ) δ t (T 1, T 2 ) = 1 (1 insrtion) δ B (T 1, T 2 ) = 5 (5 inry rnhs iffrnt) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 23 / 28
Proof Skth Binry Brnh Distn Lowr Boun for th Eit Distn In gnrl it n shown tht Rnm hngs t most 4 inry rnhs Insrt hngs t most 5 inry rnhs Dlt hngs t most 5 inry rnhs Eh it oprtion hngs t most 5 inry rnhs, thus δ B (T, T ) 5 δ t (T, T ). Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 24 / 28
Outlin Binry Brnh Distn Complxity 1 Binry Brnh Distn Binry Rprsnttion of Tr Binry Brnhs Lowr Boun for th Eit Distn Complxity Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 25 / 28
Binry Brnh Distn Complxity Complxity: Binry Brnh Distn Comput th istn twn two trs of siz O(n): (S = {T 1, T 2 }, n = mx{ T 1, T 2 }) Constrution of th inry rnh vtors BBV (T 1 ) n BBV (T 2 ): 1. BiB(S) omput th inry rnhs of T 1 n T 2 : O(n) tim n sp (trvrs T 1 n T 2 ) 2. BiB sort (S) sort sriliz inry rnhs of BiB(S): O(n log n) tim n O(n) sp 3. onstrut BBV (T 1 ) n BBV (T 2 ): () trvrs ll inry rnhs: O(n) tim n sp () for h inry rnh fin position i in BiB sort (S): O(n log n) tim (inry srh in BiB sort (S) for n inry rnhs) () BBV (T)[i] is inrmnt: O(1) Computing th istn: th two inry rnh vtors r of siz O(n) omputing th istn hs tim omplxity O(n) (sutrting two inry rnh vtors) Th ovrll omplxity is O(n log n) tim n O(n) sp. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 26 / 28
Binry Brnh Distn Complxity Improving th Tim Complxity with Hsh Funtion Not: Improvmnt using hsh funtion: w ssum hsh funtion tht mps th O(n) inry rnhs to O(n) ukts without ollision w o not sort BiB(S) position i in th vtor BBV (T) is omput using th hsh funtion O(n) tim (inst of O(n log n)) n O(n) sp In th following w ssum th sort lgorithm with O(n log n) runtim. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 27 / 28
Binry Brnh Distn Complxity for Similrity Joins Complxity Join two sts with N trs h (tr siz: n): Comput Binry Brnh Vtors (BBVs): O(Nn log(nn)) tim, O(N 2 n) sp BBVs r of siz O(Nn) tim: sort O(Nn) inry rnhs / O(Nn) inry srhs in BBVs sp: O(N) BBVs must stor Comput Distns: O(N 3 n) tim omputing th istn twn two trs hs O(Nn) tim omplxity (sutrting two inry rnh vtors) O(N 2 ) istn omputtions rquir Ovrl Complxity: O(N 3 n + Nn log n) 2 tim n O(N 2 n) sp 2 O(N 3 n + Nn log(nn)) = O(N 3 n + Nn log N + Nn log n) = O(N 3 n + Nn log n) Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 28 / 28
Rui Yng, Pnos Klnis, n Anthony K. H. Tung. Similrity vlution on tr-strutur t. In Proings of th ACM SIGMOD Intrntionl Confrn on Mngmnt of Dt, pgs 754 765, Bltimor, Mryln, USA, Jun 2005. ACM Prss. Augstn (Univ. Slzurg) Similrity Srh Wintrsmstr 2016/2017 28 / 28