A Unifying Framework for Merging and Evaluating XML Information

Similar documents
Roadmap. XML Indexing. DataGuide example. DataGuides. Strong DataGuides. Multiple DataGuides for same data. CPS Topics in Database Systems

Search sequence databases 3 10/25/2016

Higher order derivatives

The Equitable Dominating Graph

Application of Vague Soft Sets in students evaluation

CS 361 Meeting 12 10/3/18

Addition of angular momentum

Propositional Logic. Combinatorial Problem Solving (CPS) Albert Oliveras Enric Rodríguez-Carbonell. May 17, 2018

Fourier Transforms and the Wave Equation. Key Mathematics: More Fourier transform theory, especially as applied to solving the wave equation.

(Upside-Down o Direct Rotation) β - Numbers

1 Minimum Cut Problem

Addition of angular momentum

Derangements and Applications

cycle that does not cross any edges (including its own), then it has at least

Homework #3. 1 x. dx. It therefore follows that a sum of the

1 Isoparametric Concept

COMPUTER GENERATED HOLOGRAMS Optical Sciences 627 W.J. Dallas (Monday, April 04, 2005, 8:35 AM) PART I: CHAPTER TWO COMB MATH.

3 Finite Element Parametric Geometry

There is an arbitrary overall complex phase that could be added to A, but since this makes no difference we set it to zero and choose A real.

COUNTING TAMELY RAMIFIED EXTENSIONS OF LOCAL FIELDS UP TO ISOMORPHISM

Category Theory Approach to Fusion of Wavelet-Based Features

On spanning trees and cycles of multicolored point sets with few intersections

Elements of Statistical Thermodynamics

Construction of asymmetric orthogonal arrays of strength three via a replacement method

Searching Linked Lists. Perfect Skip List. Building a Skip List. Skip List Analysis (1) Assume the list is sorted, but is stored in a linked list.

Week 3: Connected Subgraphs

BINOMIAL COEFFICIENTS INVOLVING INFINITE POWERS OF PRIMES

Basic Polyhedral theory

The Matrix Exponential

Mutually Independent Hamiltonian Cycles of Pancake Networks

The van der Waals interaction 1 D. E. Soper 2 University of Oregon 20 April 2012

22/ Breakdown of the Born-Oppenheimer approximation. Selection rules for rotational-vibrational transitions. P, R branches.

Abstract Interpretation: concrete and abstract semantics

That is, we start with a general matrix: And end with a simpler matrix:

Properties of Phase Space Wavefunctions and Eigenvalue Equation of Momentum Dispersion Operator

Computing and Communications -- Network Coding

Background: We have discussed the PIB, HO, and the energy of the RR model. In this chapter, the H-atom, and atomic orbitals.

International Journal of Scientific & Engineering Research, Volume 6, Issue 7, July ISSN

Observer Bias and Reliability By Xunchi Pu

The Matrix Exponential

2.3 Matrix Formulation

Brief Introduction to Statistical Mechanics

Introduction to Arithmetic Geometry Fall 2013 Lecture #20 11/14/2013

GEOMETRICAL PHENOMENA IN THE PHYSICS OF SUBATOMIC PARTICLES. Eduard N. Klenov* Rostov-on-Don, Russia

ph People Grade Level: basic Duration: minutes Setting: classroom or field site

A Propagating Wave Packet Group Velocity Dispersion

EXST Regression Techniques Page 1

ME 321 Kinematics and Dynamics of Machines S. Lambert Winter 2002

Estimation of apparent fraction defective: A mathematical approach

What are those βs anyway? Understanding Design Matrix & Odds ratios

10. The Discrete-Time Fourier Transform (DTFT)

Middle East Technical University Department of Mechanical Engineering ME 413 Introduction to Finite Element Analysis

Enumerating Unlabeled and Root Labeled Trees for Causal Model Acquisition

First derivative analysis

Numerical considerations regarding the simulation of an aircraft in the approaching phase for landing

As the matrix of operator B is Hermitian so its eigenvalues must be real. It only remains to diagonalize the minor M 11 of matrix B.


CPSC 665 : An Algorithmist s Toolkit Lecture 4 : 21 Jan Linear Programming

CHAPTER 1. Introductory Concepts Elements of Vector Analysis Newton s Laws Units The basis of Newtonian Mechanics D Alembert s Principle

A central nucleus. Protons have a positive charge Electrons have a negative charge

Ch. 24 Molecular Reaction Dynamics 1. Collision Theory

Symmetric centrosymmetric matrix vector multiplication

EXPONENTIAL ENTROPY ON INTUITIONISTIC FUZZY SETS

BINOMIAL COEFFICIENTS INVOLVING INFINITE POWERS OF PRIMES. 1. Statement of results

Hydrogen Atom and One Electron Ions

Molecular Orbitals in Inorganic Chemistry

Text: WMM, Chapter 5. Sections , ,

u 3 = u 3 (x 1, x 2, x 3 )

ECE602 Exam 1 April 5, You must show ALL of your work for full credit.

Lecture 37 (Schrödinger Equation) Physics Spring 2018 Douglas Fields

Problem Set #2 Due: Friday April 20, 2018 at 5 PM.

AS 5850 Finite Element Analysis

Quasi-Classical States of the Simple Harmonic Oscillator

On the Hamiltonian of a Multi-Electron Atom

Section 6.1. Question: 2. Let H be a subgroup of a group G. Then H operates on G by left multiplication. Describe the orbits for this operation.

A Prey-Predator Model with an Alternative Food for the Predator, Harvesting of Both the Species and with A Gestation Period for Interaction

ANALYSIS IN THE FREQUENCY DOMAIN

SOME PARAMETERS ON EQUITABLE COLORING OF PRISM AND CIRCULANT GRAPH.

Abstract Interpretation. Lecture 5. Profs. Aiken, Barrett & Dill CS 357 Lecture 5 1

COHORT MBA. Exponential function. MATH review (part2) by Lucian Mitroiu. The LOG and EXP functions. Properties: e e. lim.

SCHUR S THEOREM REU SUMMER 2005

Data Assimilation 1. Alan O Neill National Centre for Earth Observation UK

Slide 1. Slide 2. Slide 3 DIGITAL SIGNAL PROCESSING CLASSIFICATION OF SIGNALS

Final Exam Solutions

Approximation and Inapproximation for The Influence Maximization Problem in Social Networks under Deterministic Linear Threshold Model

Molecular Orbital Diagrams for TM Complexes with

On the irreducibility of some polynomials in two variables

MATH 319, WEEK 15: The Fundamental Matrix, Non-Homogeneous Systems of Differential Equations

Title: Vibrational structure of electronic transition

Function Spaces. a x 3. (Letting x = 1 =)) a(0) + b + c (1) = 0. Row reducing the matrix. b 1. e 4 3. e 9. >: (x = 1 =)) a(0) + b + c (1) = 0

The pn junction: 2 Current vs Voltage (IV) characteristics

5.80 Small-Molecule Spectroscopy and Dynamics

Principles of Humidity Dalton s law

Recall that by Theorems 10.3 and 10.4 together provide us the estimate o(n2 ), S(q) q 9, q=1

Finding low cost TSP and 2-matching solutions using certain half integer subtour vertices

perm4 A cnt 0 for for if A i 1 A i cnt cnt 1 cnt i j. j k. k l. i k. j l. i l

Thus, because if either [G : H] or [H : K] is infinite, then [G : K] is infinite, then [G : K] = [G : H][H : K] for all infinite cases.

u x v x dx u x v x v x u x dx d u x v x u x v x dx u x v x dx Integration by Parts Formula

LINEAR DELAY DIFFERENTIAL EQUATION WITH A POSITIVE AND A NEGATIVE TERM

2F1120 Spektrala transformer för Media Solutions to Steiglitz, Chapter 1

Transcription:

A Unifying Framwork for Mrging and Evaluating XML Information HoLam Lau and Wilfrd Ng Dpartmnt of Computr Scinc, Th Hong Kong Univrsity of Scinc and Tchnology, Hong Kong {lauhl, wilfrd}@cs.ust.hk Abstract. With th vr incrasing connction btwn XML information systms ovr th Wb, usrs ar abl to obtain intgratd sourcs of XML information in a cooprativ mannr, such as dvloping an XML mdiator schma or using Xtnsibl Stylsht Languag Transformation (XSLT). Howvr, it is not trivial to valuat th quality of such mrgd XML data, vn whn w hav th knowldg of th involvd XML data sourcs. Hrin, w prsnt a unifying framwork for mrging XML data and study th quality issus of mrgd XML information. W captur th covrag of th objct sourcs as wll as th structural divrsity of XML data objcts, rspctivly, by th two mtrics of Information Compltnss (IC) and Data Complxity (DC) of th mrgd data. 1 Introduction Information intgration, a long stablishd fild in diffrnt disciplins of Computr Scinc such as cooprativ systms and mdiators, is rcognizd as an important databas subjct in a distributd nvironmnt [5, 8]. As th ntworking and mobil tchnologis advanc, th rlatd issus of information intgration bcom vn mor challnging, sinc mrgd data can b asily obtaind from a wid spctrum of mrging modrn data applications, such as mobil computing, prtopr transmission, mdiators, and data warhousing. As XML data mrgs as a dfacto standard of Wb information, w find it ssntial to addrss th quality issus of intgratd XML information. In this papr, w attmpt to stablish a natural and intuitiv framwork for assssing th quality of mrging XML data objcts in a cooprativ nvironmnt. W assum that thr ar many XML information sourcs which rturn thir own rlvant XML data objcts (or simply XML data trs) as a consqunc of sarching for a rquird ntity from th usrs. To gain th maximal possibl information from th sourcs, a usr should first qury th availabl sourcs and thn intgrat all th rturnd rsults. W do not study th tchniqus usd in th sarch and intgration procsss of th rquird XML data objcts as discussd in [1, 2, ]. Instad, w study th problm of how to justify th quality of mrgd XML information rturnd from th cooprativ sourcs. W propos a framwork to prform mrging and to analyz th mrgd information modlld as multipl XML data objcts rturnd from a st of XML L. Zhou, B.C. Ooi, and X. Mng (Eds.): DASFAA 2005, LNCS 45, pp. 81 94, 2005. c SpringrVrlag Brlin Hidlbrg 2005

82 H.L. Lau and W. Ng information sourcs. Essntially, our analysis is to convrt an XML data objct in an Mrgd Normal Form (MNF) and thn analyz th data contnt of th normalizd objct basd on a Mrgd Tr Pattrn (MTP). W dvlop th notions of Information Compltnss (IC) and Data Complxity (DC). Ths ar th two componnts rlatd to th masur of th information quality. Intuitivly, IC is dfind to comput th following two faturs rlatd to th compltnss of thos involvd information sourcs. First, how many XML data objcts (or quivalntly, XML objct trs) can b covrd by a data sourc, and scond, how much dtail dos ach XML data objct contain. W call th first fatur of IC th mrgd tr covrag (or simply th covrag) and th scond fatur of IC th mrgd tr dnsity (or simply th dnsity). Th motivation for us to dfin IC is that, in rality whn posing quris upon a st of XML information sourcs that hav littl ovrlaps in som prdfind st of cor labls C, thn th intgratd information contains a larg numbr of distinct XML data objcts but with fw subtrs or data valus undr th cor labls, in this cas th intgratd information has comparativly high covrag but low dnsity. On th othr hand, if th sourcs hav larg ovrlaps in C, th intgratd information contains a small numbr of distinct objcts with mor subtrs or data lmnts undr th cor labls, in this cas th intgratd information has comparativly low covrag but high dnsity. Th mtric DC is dfind to comput th following two faturs rlatd to th complxity of th rtrivd data itms, rsulting from mrging data from thos involvd information sourcs. First, how divrsifid th mrgd lmnts or th data undr a st of cor labls ar, and scond, how spcific thos mrgd lmnts or data ar. W call th first fatur of DC th mrgd tr divrsity (or simply th divrsity) and th scond fatur of DC th mrgd tr spcificity (or simply th spcificity). In rality, whn w mrg th data undr a labl in C it may lad to a too wid and dp tr structur. For xampl, if most data of th sam objct from diffrnt sourcs disagr with ach othr, thn w hav to mrg a divrs st of subtrs or data lmnts undr th labl. Furthrmor, th mrgd tr structur undr th labl can b vry dp, i.. to giv vry spcific information rlatd to th labl. W assum a global viw of data, which allows us to dfin a st of cor labls of an ntity that w sarch ovr th sourcs. As a cor labl may happn anywhr along a path of th tr corrsponding to th ntity instanc, w propos a Mrg Normal Form (MNF). Essntially, an XML objct in MNF nsurs that only th lowst cor labl along a path in th tr can contain intrstd subtrs or data lmnts. Assuming all XML objcts ar in MNF w aggrgat thm into a univrsal tmplat calld Mrgd Tr Pattrn (MTP). W prform mrging on th subtrs or data valus associatd with C from XML tr objcts: if th two corrsponding cor paths (paths having a cor labl) from diffrnt objcts ar qual, thn thy can b unanimously mrgd in MTP. If th two paths ar not qual, th conflict is rsolvd by changing th path to a gnral dscndant path. Finally, if th two cor paths do not xist thn thy ar said to b incomplt, th missing nod in MTP will b countd whn computing IC.

A Unifying Framwork for Mrging and Evaluating XML Information 8 Th main contributions is that w stablish a framwork for valuating th quality of intgratd XML information. Our approach is basd on a mrgd XML tmplat calld MTP, which is usd to aggrgat XML data objcts from diffrnt sourcs. Th framwork is dsirabl for svral rasons. First, th IC scor is a simpl but an ffctiv mtric to assss th quality of individual data sourc or a combination of data sourcs, which can srv as a basis for sourc slction optimization. Th DC is a natural mtric to assss th divrsity and spcificity of th subtrs undr cor labls. Scond, MTP shars th bnfits of traditional nstd rlations which ar abl to minimiz rdundancy of data. This allows a vry flxibl intrfac at th xtrnal lvl, sinc both flat and hirarchical data can b wll prsntd to th usrs. Th MTP provids for th xplicit rprsntation of th structur as wll as th smantics of objct instancs. Finally, an XML data objcts T can b convrtd into MNF in a linar tim complxity, O(k 1 + k 2 ), whr k 1 is th numbr of nods and k 2 is th numbr of dgs in T. Papr Organisation. Sction 2 formaliss th notion of intgration for a st of XML data objcts from a givn sourc, which includs th discussion of th mrgd objcts and th mrgd normal form (MNF). Sction introducs th concpt of XML mrgd tr pattrn (MTP) and illustrats how XML data objcts can b mrgd undr th MTP. Sction 4 dfins th componnts of masuring quality of intgratd XML information. Finally, w giv our concluding rmarks in Sction 5. 2 Mrging Data from XML Information Sourcs In this sction, w assum a simpl information modl consisting of diffrnt XML sourcs, which can b viwd as a st of objct trs. W introduc two notions of Mrg Normal Form (MNF) and Mrg Tr Pattrn (MTP) in ordr to valuat th mrgd rsults. Our assumptions of th information modl of cooprativ XML sourcs ar dscribd as follows. Cor Labl St. W assum a spcial labl st ovr th information sourcs, dnotd as C = {l 1,...,l n }. C is a st of cor tag labls (or simply cor labls) rlatd to th rqustd ntity. W trm thos paths starting from th ntity nod with tag labl l lading to a cor nod with tag labl th cor paths. C also consists of a uniqu ID labl, K, to idntify an XML objct instanc of. A usr qury q =, C is a slction of diffrnt information rlatd to a cor labl, which should includ th spcial K path. W assum htrognicity of data objcts to b rsolvd lswhr, such as using data wrapprs and mdiators. Ky Labl and Path. W assum an ntity constraint: if two sourcs prsnt an XML data objct thn w considr ths objcts rprsnt th sam ntity in ral world. Th K labl is important to mrg information of idntical objcts from diffrnt sourcs. W do not considr th gnral cas of FDs in ordr to simplify our discussion. Th assumption of th K tag labl is practical, sinc in

84 H.L. Lau and W. Ng rality an ID labl is commonly availabl in XML information, which is similar to th rlational stting. Sourc Rlationships. Th information sourc contnts ovrlap to various dgrs with ach othrs, rgarding th storag of XML data objcts. In an xtrm cas, on sourc can b qual to anothr sourc, for xampl mirror Wb sits. In th othr xtrm cas, on sourc can b disjoint from anothr, i.. no common XML data objct xists in two sourcs, for xampl on sourc holds ACM publications and anothr sourc holds IEEE publications. Usually, indpndnt sourcs hav diffrnt dgrs of ovrlaps,.g. thy shar information of common objcts. Furthrmor, if all objcts in on sourc xist in anothr largr on thn w say th formr is containd in th lattr. Exampl 1. Figur 1 shows thr publication objcts, and T, all of which hav a K path, ky, rprsntd in diffrnt XML objct trs as shown. Each objct contains diffrnt substs of th cor labls C = {ky, author, titl, url, yar}. C = {ky, author, titl, url, yar} r r r pub pub pub ky book titl yar ky articl author titl yar url ky author author titl yar GH24 501 SQL/Data Systm... 1981 XH2 90292 Phil Shaw Modification of Usr... 1990 db/systms/ sqlpaprs... RJ 276 Sai Strong Indx Path Lngth... 1980 T Fig. 1. A sourc of thr XML data objcts of publication rcords W now considr mrging th sam XML data objct idntifid by th K path. Thr ar svral scnarios arising from mrging an objct obtaind from two diffrnt sourcs. (1) A cor path l Cof th objct dos not xist in ithr sourcs. (2) A cor path l Cof th objct is providd by only on sourc. () A cor path l Cof th objct is providd by both sourcs but thir childrn undr th lnod may b distinct. Th first and scond cass do not impos any problms for mrging, sinc w simply nd to aggrgat th xistnt paths in th mrgd rsult. Th outcom of th mrg is that thr is ithr no information for th path l or a uniqu pic of information for th path l in th mrgd rsult. Th last cas dos not bring into any problm if th childrn (data valus or subtrs) undr th cor labl obtaind from th sourcs ar idntical. Howvr, it poss a problm whn thir childrn disagr with ach othr, sinc conflicting information happns in th mrgd rsult. Our approach is diffrnt from th common ons which adopt ithr human intrvntion or som prdfind rsolution schms to rsolv th conflicting data. W mak us of th flxibility of XML and introduc a spcial mrg nod lablld as m (mnod) as a parnt nod to mrg th two subtrs as its childrn. W formaliz th notion of mrging in th following Dfinition. Dfinition 1. (Mrging Subtrs Undr Cor Labls) Lt v 1 and v 2 b th roots of two subtrs, and, undr a cor labl l C.Ltm b th

A Unifying Framwork for Mrging and Evaluating XML Information 85 spcial labl for mrging subtrs. W construct a subtr T having th childrn gnratd by and, whr v is a mnod whos childrn ar dfind as follows. (1) T has two childrn of and undr th root v, if nithr v 1 nor v 2 ar mnods. (2) T has th childrn with bing addd immdiatly undr v,ifv 1 is undr a mnod but v 2 is not. (Similar for th cas if v 2 is th only mnod.) () T has th childrn child(v 1 )andchild(v 2 ) undr v, if both v 1 and v 2 ar undr mnods. A mrg oprator on two givn subtrs having th roots, v 1 and v 2, undr a givn l, dnotd as mrg(v 1,v 2 ), is an opration which rturns T as a child undr th lnod, dfind according to th abov conditions. Figur 2 shows th thr possibl rsults of mrg(v 1,v 2 ), on th two subtrs, and, undr th cor nod with labl l C. Th thr cass corrspond to th cass statd in Dfinition 1. W can s that th rsultant subtr T,which has th root of a mnod, is constructd from and......................... v 1 v 2 m v v 1 v 2 (v1, m ) child(v 1 ) v 2 child(v 1 ) (v1, m ) v v 2 (v 1, m ) child(v 1 ) child(v 2 ) (v 2, m) child(v 1 ) m v child(v 2 ) Cas 1 Cas 2 Cas Fig. 2. Th mrg oprator on two subtrs and undr a cor labl Th mrg oprator can b naturally xtndd to mor than two input childrn undr a givn cor nod with a labl l C. W can vrify th following commutativity and associativity proprtis of th mrg oprator: mrg(v 1,v 2 )= mrg(v 2,v 1 )andmrg(mrg(v 1,v 2 ),v )=mrg(v 1,mrg(v 2,v )). In addition, th mrg oprator is abl to prsrv th occurrncs of idntical data itms of a cor labl of th sam objct. Our us of th mrg nod has th bnfit that it provids th flxibility of furthr procssing of th childrn undr th mnod, which is indpndnt on any prdfind rsolution schms for conflicting data. For xampl, in th cas of having flat data valus undr th mnod, w may choos an aggrgat function such as min, max, sum or avg to furthr procss th conflicting rsults. In th cas of having tr data undr th mnod, w may us a tr pattrn to filtr away th unwantd spcific information. On might think that it is not sufficint to dfin th mrg oprator ovr th sam objct from diffrnt sourcs. In fact, th mrg oprator has also ignord th fact that in a cor path, mor than on cor labl may occur. In ordr to dal with ths complications, w nd th concpts of Mrg Normal Form (MNF) and Mrg Tr Pattrn (MTP) to handl gnral mrging of XML data objcts. Dfinition 2. (Mrg Normal Form) Lt T b an XML objct tr, whr P Cb th st of cor labls in T and K Pis th ky labl of T. Lt us call thos nods having a cor labl cor nods and thos path having a cor nod

86 H.L. Lau and W. Ng cor paths. AtrT is said to b in th Mrg Normal Form (MNF), dnotd as N(T ), if for any cor paths p in T, all th ancstor nods of th lowst cor nods of p hav on and only on child. Intuitivly, th MNF allows us to stimat how much information is associatd with th cor labls of an ntity by simply chcking th lowst cor labl in a path. W now prsnt an algorithm which convrts a givn XML objct tr, T, into an MNF. By Dfinition 2, w ar abl to viw N(T )={p 1,...,p n } as th mrgd normal form of T, whr p i is th cor path from th root to th lowst cor nod in th path. Not that any particular cor labl may hav mor than on cor path in N(T ). MNF Gnration Input: an XML objct tr T. Output: th MNF of th tr T r = N(T ) N(T ){ 1. Lt T r = φ; P = φ; 2. Normal(T.root);. rturn N(T ):=T r;} Normal(n) { 1. for ach child nod h i of n { 2. Normal(h i);. if (labl(n) C) { 4. if (n has noncor childrn) { 5. if (thr xist a path, path(n ) T r, such that path(n )=path(n)) 6. child(n )=mrg(child(n ),child(n)); 7. ls { 8. if (labl(n) / P) 9. P labl(n); 10. T r path(n); }}}}} Th undrlying ida of Algorithm 2 is to visit ach nod in T itrativly in a dpth first mannr until all distinct cor paths ar copid as sparat branchs into N(T ). Th cor paths that hav no noncor subtrs attachd ar rmovd. If thr ar two cor paths ndd at nods with th sam cor labl, w chck if thr xists a path, path(n ) N(T ), such that path(n )=path(n), thir valu ar simply mrgd togthr, othrwis, path(n) is addd as a nw branch. Th complxity of Algorithm 2 is O(k 1 + k 2 ), whr k 1 is th numbr of nod and k 2 is th numbr of dgs in T. Not that th NMF of T may not b a uniqu N(T ) from Algorithm 2. Howvr, it is asy to show that th output satisfis th rquirmnt in Dfinition 2. From now on, w assum that all XML data objct trs ar in MNF (or ls, thy can b transformd to MNF by using Algorithm 2.) W now xtnd th mrg oprations on two XML data objcts.

A Unifying Framwork for Mrging and Evaluating XML Information 87 Dfinition. (Mrging XML Data Objct from Two Sourcs) Lt cor(t ) b th st of cor labls in T. Givn two XML objct trs, = {p 1,...,p n } and = {q 1,...,q m }, whr p i and q j ar cor paths. W dfin T = mrg(, ) such that T satisfis (1) p i T, whr p i,p i /. (2) q j T, whr q j,q j /. () r k T and child(n rk )=mrg(n pi,n qj ), whr r k = p i = q j. Cas 1 U = O K C 1 C 2 C K C 4 C 5 C 6 K C 1 C 2 C C 4 C 5 C 6 Cas 2 UI T K C 1 C 2 C K C 1 C K C 1 C 2 C Cas U!= O T K C 4 C 5 C 4 C K C 4 C K C 5 C 1 C 2 C 1 C 1 C 2 C C T Fig.. Mrging of MNF XML trs N()andN() Figur shows th thr possibilitis of mrging an XML objct tr from two diffrnt sourcs. By Algorithm 2, w can transform an XML objct tr into its MNF, which can b viwd as a st of basic cor paths. W dnot S as a st of XML objct trs in MNF, S = {,,...,T n }. W furthr dvlop a tmplat for gnral mrging, calld th Mrg Tr Pattrn (MTP), which is usd to mrg th information of a givn st of normalizd objct trs obtaind from diffrnt sourcs. Essntially, w prform mrging th childrn of th basic cor paths itrativly within MTP. Dfinition 4. (Mrg Tr Pattrn) Lt cor(t ) dnot th st of basic cor labls in T.LtT = {,...,T n }. A Mrg Tr Pattrn (MTP) is a tr tmplat obtaind by combining th trs in T. An MTP is gnratd according to th following algorithm. W say that two basic cor paths ar mismatchd, if th two givn paths both nd at th sam cor labl but thy hav diffrnt lists of cor nods along th basic cor path. Th child of ach laf of th basic cor path in th MTP is a list of lmnts which stor th data corrsponding to th basic cor path. W also dfin dsc(n) to b th dscndant axis of th nod n. For xampl, givn path(n) = r/a/b/c/d, w hav dsc(n) = r//d. Exampl 2. Figur 4 dmonstrats th gnration of MTP with thr XML trs in MNF forms,, and T. W can chck that th cor path for titl ar diffrnt in, and T, thrfor, w rprsnt it as th dscndant axis r//titl. Th child of th cor labl author in is a subtr of noncor labls, in MTP, w insrt a lablld pointr (author, 1) to indicat it. Not that th child of th cor labl author subtr is a subtr having th root mnod. Th list in Algorithm 4 dos not stor th whol tr structur, w only nd to insrt a lablld pointr (m, 2) dirctd to th rquird subtr as shown.

88 H.L. Lau and W. Ng MTP Gnration Lt T i = {path(n 1i ),...,path(n pi )}, 1 i n and T = {path(m 1),...,path(m q)} Input: a st of trs in MNF, T = {,...,T n} Output: th MTP of T MTPGn(T ){ 1. Lt T = ; 2. For ach T i in T{ 4. For ach laf nod, n, int i { 5. if (path(n i)andpath(m j)armismatchd) { 6. path(m j)=dsc(m j); 7. list(m j).add(mrg(valu(m j),valu(n i))); } 8. ls {T path(n i); 9. list(n i).add(valu(n i)); }}} 10. rturn MTPGn(T ):=T; } pub pub pub firstnam lastnam author yar author ky book k yar y articl url ky titl m yar Phil Shaw titl firstnam lastnam titl GH24 SQL/Data 1981 XH2 Modification 1990 db/systms/ RJ Sai Strong Phil Shaw Indx Path 1980 501 Systm... 90292 of Usr... sqlpaprs... 276 Lngth... Sai Strong pub k author titl yar url y GH24 SQL/Data 1981 501 Systm... XH2 (author, 1) Modification db/systms/ 1990 90292 of Usr... sqlpaprs... RJ Indx Path (m, 2) 276 Lngth... 1980 T MTP of, and T Fig. 4. Gnration of MTP with thr XML trs in MNF Mrg Oprations on MTP In ordr to prform mrging of th ntir qury rsults from multipl sourcs, w dfin two usful mrg oprators on MTPs, th join mrg,, and th union mrg,. Dfinition 5. (Join Mrg Oprator) Lt P 1 and P 2 b two MTPs drivd from th sourcs S 1 and S 2.Ltcor(P 1 ), cor(p 1 ) Cb th st of cor labls obtaind from th sourcs S 1 and S 2, rspctivly, whr K cor(s 1 ), and K cor(s 2 ). Thn w dfin th MTP P = P 1 P 2, such that S 1 and S 2 satisfis that if r 1 //K = r 2 //K, thn T is constructd by: 1. path(r //K) :=path(r 1 //K). 2. path(r //l/v ):=path(r 1 //l/v 1 ), whr l (cor(p 1 ) cor(p 1 )).. path(r //l/v ):=path(r 2 //l/v 2 ), whr l (cor(p 1 ) cor(p 1 )). 4. path(r //l/v ):=dsc(v ) whr child(n v ):=mrg(child(n v1 ),child(n v2 )), whr l (cor(p 1 ) cor(p 2 )). Th intuition bhind th abov dfinition is that in th first condition w adopt th ky path as th only critrion to join th tr objcts idntifid by K obtaind from P 1 and P 2. Th scond and third conditions stat that w choos

A Unifying Framwork for Mrging and Evaluating XML Information 89 K C 1 C 2 C 1 (m, 1) (m, 2) 2 W 1 X 1 Y 1 Y 2 K C 1 C C 4 1 4 W 2 (m,) W (C, 4) (a) MTP 1 (b) MTP 2 Z 1 K 1 C 1 C 2 C C 4 (m, 1) (m, 2) (m,) X 1 (m, 4) Z 1 W 2 Y 2 C, 4 (c) MTP 1 MTP 2 Y 1 K 1 C 1 C 2 C C 4 (m, 1) (m, 2) (m,) 2 W 1 X 1 (m, 4) Z 1 4 W C W, 4 2 Y Y 1 2 (d) MTP 1 MTP 2 Fig. 5. Th Join Mrg and Join Union Oprators all th paths from th two MTPs as long as thy do not ovrlap. Th fourth condition is to rsolv th conflict of having th common path in both sourc MTPs by using a dscndant path and mrging th nod information. Exampl. Figur 5(c) and 5(d) illustrats th us of th joinmrg and joinunion oprations on P 1 and P 2. Dfinition 6. (Union Mrg Oprator) Lt P 1 and P 2 b two MTPs drivd from th sourcs S 1 and S 2.Ltcor(P 1 ), cor(p 1 ) Cb th st of cor labls obtaind from th sourcs S 1 and S 2, rspctivly, whr K cor(s 1 ), and K cor(s 2 ). Thn w dfin th MTP P = P 1 P 2, such that S 1 and S 2 satisfis that, if path(r 1 //K) =path(r 2 //K), thn T is constructd by: (1) path(r //K) :=path(r 1 //K). (2)path(r //l/v ):=path(r 1 //l/v 1 ), whr l (cor(p 1 ) cor(p 1 )). () path(r //l/v ) := path(r 2 //l/v 2 ), whr l (cor(p 1 ) cor(p 1 )). (4) path(r //l/v ) := dsc(v ) whr child(n v ) = mrg(child(n v1 ),child(n v2 )), whr l (cor(p 1 ) cor(p 2 )). Or ls T is constructd by path(r //l/v ):=path(r 1 //l/v 1 )orpath(r //l/v ):=path(r 2 //l/v 2 ). Notably, th union mrg oprator can b viwd as a gnralizd from of th fulloutr join in rlational databass [8]. 4 Quality Mtrics of Mrgd XML Trs W dscrib two masurs of information compltnss and data complxity to valuat th quality of th rsults of th join and union mrg oprators. Th mrgd tr covrag (or simply th covrag) of an XML sourc rlats to th numbr of objcts that th sourc can potntially rturn. Intuitivly, th notion of covrag capturs th prcntag of ral world information covrd in a sarch. Th problm lis in th fact that XML sourcs mutually ovrlap a diffrnt xtnt. W nd to dvis an ffctiv way to valuat th siz of covrag. Dfinition 7. (Mrgd Tr Covrag) Lt th MTP of th sourc S of a st of XML data objcts b P S and n b th total numbr of objcts rlatd to th rqustd ntity spcifid in a qury

90 H.L. Lau and W. Ng q =(, C), whr C is th st of cor labls associatd with. W dfin th mrgd tr covrag (or th covrag) ofp S with rspct to q as cov(s) = PS n, whr P S is th numbr of XML data objcts distinguishd by th objct ky K Cstord in P S. Th covrag scor of simpl objcts is btwn 0 and 1 and can b rgardd as th probability that any givn ral world objct is rprsntd by som objcts in th sourc. W adopt th union mrg oprator proposd in Dfinition 6 to gnrat th MTP for th mrgd objcts and dtrmin th covrag scor [4]. Exampl 4. Assum that thr ar about two million lctronic computr scinc publications ovr th Wb (i.. n = 2,000,000). About 490,000 of ths ar listd in th Digital Bibliography & Library Projct (DBLP) and th information is availabl in XML format. Tabl 1 shows th numbr of lctronic publications availabl on th Wb. Th covrag scors ar obtaind by dividing th numbr of publications by 2,000,000. Tabl 1. Th covrag scor of fiv lctronic publication sourcs Elctronic Publication Sourc CitSr Th Collction of Computr Scinc Bibliographis DBLP CompuScinc Computing Rsarch Rpository (CoRR) Numbr of Publication 659,481 1,46,418 490,000 412,06 75,000 Covrag Scors 0.297 0.717 0.2450 0.2061 0.075 Th covrag masur for th MTP from many sourcs can b computd in a similar way, basd on th covrag scors of individual sourcs. In rality, w may download th sourc to assss th covrag or th covrag can b stimatd by a domain xprt. To rspond to a usr qury, a qury is snt to multipl XML information sourcs. Th rsults rturnd by ths sourcs ar sts of rlvant XML data objcts. Som data objcts may b rturnd by mor than on sourc. W assum that thr ar only thr diffrnt cass of ovrlapping data sourcs. 1. Th two sourcs ar disjoint, which mans that, according to th K labl, thr ar no common XML data objcts in th two sourcs. Thn cov(s i S j ) is qual to cov(s i )+cov(s j )andcov(s i S j ) is qual to 0. 2. Th two sourcs ar ovrlapping, maning that, according to th K labl, thr ar som common XML data objcts in th two sourcs. Th two sourcs ar assumd to b indpndnt. Thn cov(s i S j ) is qual to cov(s i )+cov(s j ) cov(s i ) cov(s j )andcov(s i S j ) is qual to cov(s i ) cov(s j ) (if S i is containd in S j ).. On sourc is containd in anothr, which mans that, according to th K labl, all th XML data objcts in on sourc ar containd in anothr sourc. Thn cov(s i S j ) is qual to cov(s j )andcov(s i S j ) is qual to cov(s i ) (if S i is containd in S j ). Now, w considr th gnral cas of intgrating rsults rturnd from multipl data sourcs. W mphasiz that th xtnsion of th two mrg oprations in Dfinitions 5 and 6 from two sourcs to many sourcs is nontrivial,

A Unifying Framwork for Mrging and Evaluating XML Information 91 sinc mixd kinds of ovrlapping may occur btwn diffrnt sourcs. W lt M = (S 1,...,S n ) b th rsult obtaind from unionmrgd a st of sourcs W = {S 1,...,S n }.LtS W. W dfin th disjoint sts of sourcs D W to b th maximal subst of W, such that all th sourcs in D ar disjoint with S, th containd sts of sourcs T W to b th maximal subst of W, such that all th sourcs in T ar substs of S, and th indpndnt sts of sourcs I W to b th rmaining ovrlapping cass, i.. I = W T S. Thorm 1. Th following statmnts rgarding W and S ar tru. 1. cov(m S) =cov(m)+cov(s) cov(m S). 2. If S i W such that S S i, thn cov(m S) =cov(m)+cov(s) cov(m S), or ls cov(m S) =cov(s). Th mrgd tr dnsity (or simply th dnsity) of an XML sourc rlats to th ratio of cor labl information providd by th sourc. As XML objcts hav flxibl structurs, th rturnd objct trs from a sourc do not ncssarily hav information for all th cor labls. Furthrmor, a basic cor nod may hav a simpl data valu (i.. a laf valu) or a subtr as its child. W now dfin th dnsity of a cor labl in an MTP, P S. Dfinition 8. (Cor Labl Dnsity) W dfin th mrg tr dnsity (or simply dnsity) of a cor path p l for som l Cof P S, dnotd as dn(s, l), by dn(s, l) = {T PS rt //p l/v xists in T } P S, whr r T //p l /v is a basic cor path and v is th corrsponding cor nod in T. Th dnsity of th MTP P S, dnotd as dn(s), is th avrag dnsity ovr all cor labls and is givn by dn(s) = Σ l Cdn(S,l) C. In particular, a cor labl that has a child (a laf valu of a subtr) for vry data tr of th sourc S has a dnsity of 1 in its P S. Th dnsity of a cor labl l that is simply not providd by any objct data tr has dnsity dn(s, l) =0. Cor labls for which a sourc can provid som valus hav a dnsity scor in btwn 0 and 1. By assumption dn(s, K) is always 1. Tabl 2. Th DBLP XML tabl from MTP(DBLP) ky tr/ibm/gh24... tr/ibm/rj... tr/sql/xh2... tr/dc/src... tr/gt/tr026... titl SQL/Data Systm... Indx Path Lngth... Modification of Usr... Th 1995 SQL Runion... An Evaluation of Objct... author Sai, Strong Phil Shaw Frank Manola journal IBM Publication IBM Rsarch Rport ANSI XH2 Digital Systm... GTE Laboratoris... volum GH24501 RJ276 XH290292 SRC1997018 TR0260894165 yar 1981 1980 1990 1997 1994 url db/systms/... db/labs/gt/... Exampl 5. Lt C =(ky, titl, author, journal, vloumn, yar, url) b a simplifid st of cor labls of an articl objct. Considr th DBLP tabl (ignoring all cor paths) xtractd from MTP(DBLP) as shown in Tabl 2. Th data in th tabl is th information rturnd from th DBLP sourc for sarching articls. Th dnsity of th cor labls titl and url ar dn(s, titl) =1and dn(s, url) =0.4, rspctivly.

92 H.L. Lau and W. Ng Similar to finding ral covrag scors, dnsity scors can b assssd in many ways in practic. Information sourcs may giv th scors for an assssmnt. W may also us a sampling tchniqu to stimat th dnsity. For larg data sourcs, th sampling procss can b continuous and thn th scor can b incrmntally updatd to a mor accurat valu. Now, w considr th gnral cas of n data sourcs. W us th sam st of notations M,W,T,S and I as alrady introducd in Sction 4. Thorm 2. Th following statmnts rgarding W and S ar tru. 1. (dn(m S) =dn(m,l) cov(m)+dn(s, l) cov(s) dn(t,l) cov(t ) (dn(s, l)+dn(i,l) dn(s, l) dn(i,l)) cov(s) cov(i)+(dn(i,l)+dn(t,l) 1 dn(i,l) dn(t,l)) cov(i T )) cov(m S) 2. dn(m S, l) =dn(m,l)+dn(s, l) dn(m,l) dn(s, l). Exampl 6. Assum that DBLP(D) andcitsr(c) ar indpndnt sourcs. Lt th dnsity scors for th volumn labl b 1 and 0.6 rspctivly. Th covrag scor is 0.245 and 0.297. Thus, th dnsity scor of thir mrgd rsult is givn by dn(d C) =1 0.245 + 0.6 0.297 (1 + 0.6 1 0.6) 0.245 1 0.297 0.245+0.298 0.245 0.297 =0.287. W now add th CompuScinc(S) and assum it is indpndnt of DBLP and CitSr. Its dnsity of 0.8 for th volum labl and a covrag of 0.2061. Th nw dnsity scor is givn by: dn(d C S) =0.287 0.5747 + 0.8 0.2061 (0.287 + 0.8 0.287 0.8) 1 0.5747 0.2061 0.5747+0.2061 0.5747 0.2061 =0.4179. 4.1 Information Compltnss and Data Complxity Th notion of information compltnss of an information sourc rprsnts th ratio of its information amount to th total information of th ral world. Th mor complt a sourc is, th mor information it can potntially contribut to th ovrall rspons to a usr qury. Dfinition 9. (Information Compltnss) Th Information Compltnss (IC) of a sourc S is dfind by comp(s) = No. of data objcts associatd with ach l Cin P S, W C whr W is th total numbr of data objcts of a ral world ntity and P S is th MTP of S. Th following corollary allows us to mploy covrag and dnsity to find out th IC scor. This corollary can b trivially gnralisd to a st of information sourcs using th corrsponding MTP. Intuitivly, th notion of IC can b intrprtd as th rctangular ara formd by covrag (hight) and dnsity (width). Th following xampl furthr hlps to illustrat ths idas. Corollary 1. Lt S b an information sourc. Thn comp(s) = cov(s) dn(s).

A Unifying Framwork for Mrging and Evaluating XML Information 9 Exampl 7. Tabl 2 rprsnts th ntris DBLP XML sourc. Th tabl provids only fiv tupls with varying dnsity. Th covrag of th sourc is thus 5 givn by cov(dblp) = 2,000,000. Th dnsitis for th labls ar 1, 1, 0.6, 1, 1, 1, and 0.4, rspctivly, and it follows that th dnsity of th sourc is 6 7.Thus, 5 th compltnss of DBLP is 2,000,000 6 7 = 1,400,000. W dfin spcificity and divrsity of an XML sourc to rprsnt th dpth and bradth of data that th sourc can potntially rturn. As th subtr of a cor labl may contain subtrs of flxibl structurs, th rturnd data dos not contain th sam amount of data, it may contain somthing as simpl as a singl txtual valu, or a complx subtr having many lvls. Dfinition 10. (Spcificity and Divrsity) Lt P S b th MTP of a sourc S of a st of XML data objcts. Lt avg(d i )andmax(d i ) b th avrag and maximum dpth of child subtrs undr th cor nod lablld by l i Cand D b th maximum of {max(d 1 ),...,max(d n )} whr C = n. Spcificity is dfind by spc(s) = Σavg(di) Σavg(bi) n D. Similarly, w dfin divrsity by div(s) = n B, whr avg(b i )andmax(b i ) ar th avrag numbr and th maximum of childrn of subtrs undr th cor nod lablld by l i C. Similarly, B b th maximum of {max(b 1 ),...,max(b n )}. avrag dpth of subtr in r undr cor nods C C C (m, 2) (m, 1) r (m, ) avrag bradth of subtr in r undr cor nods ky ky C 1 C 2 C 1 2 1 2 1 0 ky C 1 C 2 C 1 8/ 1 0 1 (m, 1) Y 2 4 1 7/ d=5 d= d= d=1 d=2 1 d=4 d=1 d=2 2 11/2 1 4 b=8 b= b=5 b=2 b=1 2 (m, 2) Y 2 (m, ) b=7 b=2 b= 0 1 1 0 1 1 Y Z d=4 d=2 1 d=7/ b=11/2 b=8/ b=4 Fig. 6. Spcificity and divrsity of subtrs Exampl 8. In Figur 6, undr th cor labl C 1, th list contains thr valus: (m, 1), (m, 2),. Th dpth of cor labl C 1 is d(c 1 )= d((m,1))+d((m,2))+d( ) = 2+4+0 = 2. Similarly, d(c 2 )= 1+1+1 =1andd(C )= 0+ 7 +1 =1.1111. Th dpst path is (m, 2), so w hav D = 4. Th spcificity of th tr is spc(s) = d(c 1)+d(C 2)+d(C ) n D = 2+1+1.1111 4 = 0.426. Th bradth of cor labl C 1, b(c 1 )= b((m,1))+b((m,2))+b( ) = 8 + 11 2 +0 =2.7222. Similarly, b(c 2 )= 1+1+1 =1and b(c )= 0+4+1 = 5. Th broadst subtr is b((m, 2)), so w hav B =5.5. Th divrsity of th tr is div(s) = b(c1)+b(c2)+b(c) n B = 2.7222+1+1.6667 5.5 =0.266. Th notion of data complxity of an information sourc is mployd to rprsnt th amount of information from th sourc. Th highr th data complxity of a sourc is, th richr and broadr information it can potntially contribut to th ovrall rspons to a usr qury.

94 H.L. Lau and W. Ng Dfinition 11. (Data Complxity) Th Data Complxity (DC) of a sourc S is dfind by cpx(s) = spc(s) div(s). Exampl 9. Considr th MTP in Figur 6, th data complxity, DC of th tr is cpx(s) = spc(s) div(s) = 0.426 0.266 = 0.1119. 5 Concluding Rmarks W hav proposd a framwork which consists of two usful concpts, th first bing information compltnss (IC), which rprsnts th covrag of data objcts and th dnsity of data rlatd to a st of cor labls, and th scond bing data complxity (DC), which rprsnts th divrsity and th spcificity of data contnt for a st of cor nods associatd with th sarch ntity. Th framwork allows mrging XML data objcts obtaind from diffrnt sourcs. W prsnt an MNF as a standard format to unify th ssntial data in mrgd objcts of an ntity and an fficint algorithm to transform an XML data objct into an MNF. W dvlop MTP as a unifying tmplat to rprsnt th mrg of a st of XML objcts. MTP srvs as a basis to valuat th IC and DC scors. W also invstigatd th proprtis of dnsity and covrag via two mrg oprators in diffrnt sourcs that ar disjoint, ovrlapping and indpndnt. An important issu rlatd to this work is how w obtain th valus of various mtrics of covrag, dnsity, divrsity and spcificity. W suggst that this information can b drivd from th data sourcs or from othr authority corpora. For xampl, from th probability distribution on crtain topics of CS if w compar thos complt or almost complt sourcs w can thn comput th covrag. Rfrncs 1. E. Brtino and E. Frrari. XML and Data Intgration. IEEE Intrnt Computing. 5(6):7576, 2001. 2. V. Christophids, S. Clut, and J. Simon. On Wrapping Qury Languag and Efficint XML Intgration. In Proc.ofSIGMODConfrnc, 2000.. Z. G. Ivs t al. An Adaptiv Qury Excution Systm for Data Intgration. In Proc.ofSIGMOD, 1999. 4. D. Florscu, D. Kollr, and A. Lvy. Using probabilistic information in data intgration. In Proc.ofVLDB, 1997. 5. E. Lim, J. Srivastava, S. Prabhakar and J. Richardson. Entity Idntification in Databas Intgration. In Proc. of ICDE, 199. 6. A. Motro and I. Rakov. Estimating th quality of databass. In Proc. of FQAS, 1998. 7. F. Naumann, J. C. Frytag and U. Lsr. Compltnss of Information Sourcs. In Proc.ofDQCIS, 200. 8. Elmasri and Navath. Fundamntals of Databas Systms. Addison Wsly, rd Edition, 1997.