XReason: A Semantic Approach that Reasons with Patterns to Answer XML Keyword Queries

XReason: A Semantic Aroach that Reasons with Patterns to Answer XML Keyword Queries Cem Aksoy 1, Aggeliki Dimitriou 2, Dimitri Theodoratos 1, Xiaoying Wu 3 1 New Jersey Institute of Technology, Newark, NJ, USA 2 National Technical University of Athens, Athens, Greece 3 Wuhan University, Wuhan, China

Introduction Keyword search A very oular technique Easy for users, hard for systems! Unstructured queries consisting of keywords Tend to be ambiguous Our focus in this work: XML data Poular exort and exchange method Tree-structured data 2

Examle XML database 3

Examle XML database 4

Examle XML database 1 bib 2 3 aer aer 4 5 6 7 author title booktitle 8 9 10 author title cite booktitle 11 12 XML VLDB 13 14 affl name XML SIGMOD 15 16 Integration affl name Design aer aer MIT Miller UIC Burt 17 18 19 20 21 author title booktitle title booktitle 22 23 affl name XML UIC Burt Query VLDB Data Integration EDBT 5

Search on XML Languages such as XQuery and XPath can be used An examle XQuery exression: for $x in doc( bibliograhy.xml )/bib/aer where $x/[booktitle= dasfaa ] return $x/author But they have roblems: Comlex language syntax User needs to know the database schema 6

Keyword Search on XML Query: Set of keywords Answer: Subtrees of the XML tree (not whole documents) Minimum connecting trees (MCTs) are common as results Root of MCT: Lowest Common Ancestor (LCA) of keyword matches Large number of results, mostly irrelevant Ranking is imortant 7

XML Keyword Search Examle Q={hysics, james, harrison} R1={(hysics,4),(james,9),(harrison,10)} R2={(hysics,15),(james,19),(harrison,20)} R3={(hysics,7),(james,13),(harrison,14)} 1 s 3x3x3=27 candidate results! 2 3 4 5 6 7 8 title rerequisite title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 8

XML Keyword Search Examle Q={hysics, james, harrison} R1={(hysics,4),(james,9),(harrison,10)} R2={(hysics,15),(james,19),(harrison,20)} R3={(hysics,7),(james,13),(harrison,14)} 1 s R1 2 3 4 5 6 7 8 title rerequisite title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 9

XML Keyword Search Examle Q={hysics, james, harrison} R1={(hysics,4),(james,9),(harrison,10)} R2={(hysics,15),(james,19),(harrison,20)} R3={(hysics,7),(james,13),(harrison,14)} 1 s R1 2 3 R2 4 5 6 7 8 title rerequisite title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 10

XML Keyword Search Examle Q={hysics, james, harrison} Smallest LCA (SLCA) Not SLCA 1 s SLCA LCA R1 2 3 LCA R3 R2 SLCA 4 5 6 7 8 title rerequisite LCA title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 11

XML Keyword Search Examle Q={hysics, james, harrison} Exclusive LCA (ELCA) ELCA LCA 1 s 2 ELCA 3 ELCA SLCA LCA R1 R3 R2 SLCA 4 5 6 7 8 title rerequisite LCA title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 12

XML Keyword Search Examle Q={hysics, james, harrison} Reasonable but ad hoc! Decisions are based on locality, atterns are not considered ELCA LCA 1 s 2 ELCA 3 ELCA SLCA LCA R1 R3 R2 SLCA 4 5 6 7 8 title rerequisite LCA title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 13

Contributions Novel keyword search semantics on XML which reasons with keyword query atterns Reasoning is based on homomorhisms between atterns Ranking and filtering semantics are based on a grah of atterns A stack-based algorithm for generating atterns Promising effectiveness and efficiency exerimental results on real and benchmark datasets 14

Definitions Data Model XML data is modeled as an ordered, node labeled tree Encoded with Dewey code 1 bib 1.1 1.2 aer aer 1.1.1 1.1.2 1.1.3 author title booktitle 1.2.1 1.2.2 author title 1.1.1.1 1.1.1.2 XML VLDB 1.2.1.11.2.1.2 affl name Integration affl name MIT Miller UIC Burt XML Design 1.2.3 booktitle SIGMOD 15

Definitions An answer to a keyword query on XML data is a set of instance trees (ITs) IT: minimum subtree; rooted at the root of the XML tree; contains a matching for the query keywords 1 s 2 2 (a) IT I 4 title 5 4 title 5 (b) MCT M ann(n) 9 10 fname lname 9 10 fname lname 16

Definitions Ranking semantics: all the result ITs are ranked Filtering semantics: a subset of the result ITs is returned as an answer 17

IT Patterns reresents the set of ITs that share the same structure including the labels and annotations Q={hysics, james, harrison} 2 s 1 university 3 events 4 5 6 7 year seminars [2012] 8 9 10 11 12 13 14 15 title rerequisite title year seminar [Physics II] [Statistical [2012] Physics] 16 17 18 19 20 21 22 23 24 fname lname fname lname fname lname toic seaker [Smith] [John] [George] [Miller] [Quantum [James title Physics] Harrison] [Physics I] 18

Examle on Patterns Q={hysics, james, harrison} 2 s 1 university 3 events 4 5 6 7 year seminars [2012] 8 9 10 11 12 13 14 15 title rerequisite title year seminar [Physics II] [Statistical [2012] Physics] 16 17 18 19 20 21 22 23 24 fname lname fname lname fname lname toic seaker [Smith] [John] [George] [Miller] [Quantum [James title Physics] Harrison] [Physics I] university university university university Pattern P 1 P s 2 P events s 3 P 4 s seminars seminar rerequisites title title fname lname fname lname fname lname toic seaker [James, title Harrison] P university 5 P 6 university university university P 7 P 8 s events s s events s events events seminars seminars seminars seminars rerequisites seminar title seminar title seminar title seminar seaker seaker fname seaker fname seaker [James, [James, title Harrison] Harrison] 21

Examle on Patterns (2) Q={hysics, james, harrison} 1 s 2 3 4 5 6 7 8 title rerequisite title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 22

IT Patterns Q={hysics, james, harrison} 1 s IT1 2 3 4 5 6 7 8 title rerequisite title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 [Harrison title textbook title textbook [Physics I] 19 [Calculus] 20 author author 23

IT Patterns Q={hysics, james, harrison} title Pattern s fname lname 4 title IT1 1 s IT3 1 s 2 3 5 7 8 title 9 10 13 14 fname lname fname lname 1 s IT1 2 3 IT3 4 5 6 7 8 title rerequisite title [Physics II] [Statistical Physics] 9 10 11 12 13 14 fname lname fname lname 15 16 17 18 title textbook title textbook [Physics I] 19 [Calculus] 20 author author 24

Semantics Direct comarison between atterns (instead of assigning scores to atterns) Comarison is based on different kind of relations defined on atterns These relations are defined using different tyes of homomorhisms between atterns 25

Pattern Homomorhism A maing from P to P P can be obtained from P by merging aths and unioning annotations P P h P P h P P P seminars seminars seminars year [2012] seminar seminar seaker seaker year [2012] seminar seaker seaker year seminar [2012] seaker [James, Harrison] Intuitively, P is more relevant than P, and P is more relevant than P (comactness) 26

Pattern Homomorhism (2) Pattern homomorhism does not always work P P lname fname title fname lname rerequisites title 27

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname rerequisites title 28

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname rerequisites title 29

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname rerequisites title 30

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname rerequisites title 31

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname rerequisites title 32

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname rerequisites title 33

Path Homomorhism Mas searately every root-to-annotated node ath of P to a ath in P P P lname fname title fname lname P is more relevant than P Intuition: Keyword instances in P are connected more tightly than in P rerequisites title 34

All-Path-Homomorhism (ah) Relation We use the homomorhisms introduced earlier to define a relation on atterns

All-Path-Homomorhism (ah) Relation P and P : two atterns M and M : corresonding MCTs P ah P if h M M and M M or h h M M and M M h P P ah P P lname fname title fname lname rerequisites title 36

All-Path-Homomorhism (ah) Relation P and P : two atterns M and M : corresonding MCTs P ah P if h M M and M M or h h M M and M M h P seminars year [2012] seminar seminar seaker seaker P ah P year [2012] P seminars seminar seaker seaker P ah P P seminars year seminar [2012] seaker [James, Harrison] 37

All-Path-Homomorhism (ah) Relation (2) ah relation cannot hel us in all cases For instance, when heterogeneous data is merged into one XML document university university Pattern P 4 events seminars s events seminars Pattern P 5 seminar toic seaker [James, Harrison] title lname seminar seaker 38

Partial-Path-Homomorhism (h) Relation P h P if A maing of a root-to-annotated node ath MCT root of P should be maed to a descendant of the MCT root of P university university Pattern P 4 events seminars s events seminars Pattern P 5 seminar toic seaker [James, Harrison] title lname seminar seaker 39

Partial-Path-Homomorhism (h) Relation P h P if A maing of a root-to-annotated node ath MCT root should be maed to a descendant of the destination MCT root P 4 h P 5 university LCA university Pattern P 4 events seminars s events seminars Pattern P 5 LCA seminar toic seaker [James, Harrison] title lname seminar seaker 40

XReason Semantics (Precedence) relation P P if P ah P or P h P Precedence grah G : Vertex = Pattern Edge = P P P 2 P 4 a a P 1 P 3 P 7 P 5 P 15 P 8 P 9 P 10 P 12 P 13 P 14 h h P 11 P 6 41

XReason Ranking Semantics We define order O based on (a) Ascending GLevel GLevel=1 GLevel=2 GLevel=3 GLevel=4 P 2 P 4 a a P 1 P 3 P 5 P 15 P 8 P 9 P 10 P 12 P 13 P 14 P 7 Pattern order O 1. P 2 P 4 2. P 1 3. P 3 P 7 4. P 5 P 8 P 9 P 10 P 12 P 13 P 15 P 14 5. P 11 P 6 GLevel=5 P 11 h h P 6 42

XReason Ranking Semantics We define order O based on (a) Ascending GLevel (b) Descending MCTDeth MCTDeth=3 MCTDeth=4 Pattern order O P 2 university s P 4 university events seminars seminar title fname lname toic seaker [James, Harrison] 1. P 4 2. P 2 3. P 1 4. P 3 5. P 7 6. P 5 P 8 P 9 P 10 P 12 P 13 P 14 P 15 7. P 11 P 6 43

XReason Ranking Semantics We define order O based on (a) Ascending GLevel (b) Descending MCTDeth (c) Ascending MCTSize MCTSize=9 MCTSize=7 university university P 5 P 8 s events s events seminars seminars title seminar title seminar seaker fname seaker [James, Harrison] Pattern order O 1. P 4 2. P 2 3. P 1 4. P 3 5. P 7 6. P 8 7. P 5 P 9 P 10 P 12 8. P 13 9. P 15 P 14 10. P 11 P 6 44

XReason Ranking Semantics We define order O based on (a) Ascending GLevel (b) Descending MCTDeth (c) Ascending MCTSize Pattern order O XReason ranks ITs in an order which comlies with the order O of their atterns Might create equivalence classes 1. P 4 2. P 2 3. P 1 4. P 3 5. P 7 6. P 8 7. P 5 P 9 P 10 P 12 8. P 13 9. P 15 P 14 10. P 11 P 6 45

XReason Filtering Semantics Answer: ITs whose atterns are source nodes in G : P 2 P 4 a ITs of P 2 and P 4 a P 1 P 3 P 7 P 5 P 15 P 8 P 9 P 10 P 12 P 13 P 14 h h P 11 P 6 46

Algorithm PatternStack Inut: Inverted lists of the query keywords Outut: Patterns of the query with the associated ITs Reads the inverted lists in document order Constructs the atterns on the fly incrementally Links the constructed ITs with their resective atterns 47

Exerimental Setu Datasets Effectiveness Mondial (1.7 MB), Sigmod (467 KB), EBAY (34 KB) Efficiency NASA (23 MB), XMark (150 MB) Metrics Filtering Exeriments Precision: ratio of the relevant results in the result set of the system Recall: ratio of the relevant results in the result set to all relevant results 48

Exerimental Setu Ranking Exeriments: Mean Average Precision (MAP): The mean of the average of recision scores after each relevant result of the query is retrieved Recirocal Rank (R-Rank): Recirocal of the rank of the first correct result of a query Precision-at N (P@N): Ranked list is cut off at rank N 49

Exerimental Setu Handling equivalence classes (ITs that share the same rank): Best: All correct results are assumed to be ranked at the beginning of the equivalence class Worst: All correct results are assumed to be ranked at the end of the equivalence class The best and worst versions result in lower and uer bounds for the metrics 50

Filtering Exeriments XReason is comared with SLCA: ITs whose MCT root is an SLCA ELCA: ITs whose MCT root is an ELCA ITReal: an adatation of XReal XReal finds an intended node tye ITReal returns ITs whose MCT root is a descendant of XReal s intended node tye 6 queries over each dataset (Mondial, SIGMOD, EBAY) 51

Ranking Results Dataset Semantics MAP worst MAP best R-Rank worst R-Rank best Mondial Sigmod EBAY XReason 0.95 0.95 1.00 1.00 ITReal 0.60 0.87 0.59 1.00 XReason 1.00 1.00 1.00 1.00 ITReal 0.19 0.69 0.26 0.83 XReason 1.00 1.00 1.00 1.00 ITReal 0.60 0.80 0.53 1.00 XReason has erfect R-Rank scores XReason outerforms ITReal w.r.t. MAP 52

Ranking Results Worst and best versions are suerimosed XReason has better P@10 values than ITReal 53

Filtering Results 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 M1 M2 M3 M4 M5 M6 (a) Mondial All have erfect recall XReason: close to erfect recision Worst is ELCA 0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 S1 S2 S3 S4 S5 S6 (b) SIGMOD E1 E2 E3 E4 E5 E6 (c) EBAY 54

Efficiency Exeriments Comare PatternStack with a naïve algorithm for comuting the atterns 1,000.00 1,000.00 100.00 100.00 10.00 10.00 1.00 1.00 0.10 0.10 0.01 0.01 0.00 M1 M2 M3 M4 M5 M6 0.00 S1 S2 S3 S4 S5 S6 PatternStack is significantly faster Resonse times for PatternStack are reasonable for real-time systems 55

t (msec) t (msec) Efficiency Exeriments 550 500 450 400 350 300 250 200 150 100 50 0 Scalability With three and four keywords We truncated the inverted lists at different sizes: 20%, 40%, 60%, 80% and 100% k=3 k=4 0 50,000 100,000 150,000 200,000 250,000 Number of results (a) NASA 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 k=3 k=4 0 5,000 10,000 15,000 20,000 25,000 Number of results (b) XMark 56

Conclusion & Future Work Homomorhisms can be effectively used to comare atterns Results show that the aroach is effective and efficient Otimizations about the recedence grah for efficiency Additional information such as statistical information can be combined for ranking the results 57