4. To answer a twig query A twig pattern is decomposed into several path patterns. Path solutions are joined together to compose a final result. Holistic Twig Join(HTJ) algorithm Specialized multi-way& sort-merge join guarantees I/O optimality for a certain subset of XML query. The optimality depends on how the elements are partitioned. uses stacks and streams in which elements are sorted in an order. CIKM'09, Hong Kong 4 Twig Join A A E B C SA A A SE SB B E SC C Stacks Streams
5. Discrepancy between XML in RDB and conventional HTJ algorithms Logical: Streams vs. Table Physical: partitioned vs. record-oriented Supporting actual data including a large volume of texts requires references to records. How to feed tuples to HTJ algorithm? What’s the best partitioning scheme for XML stored in RDB? Bitmap index, a conventional index in RDBMS An efficient way to indicate tuples. Efficient support for logical operations Can we use the bitmap index for supporting HTJ? CIKM'09, Hong Kong 5 Motivation
6. Tag-based partitioning Simple, and skipping technique can be used to read useful elements only. For a query node, only one stream is accessed Tag+Level partitioning More I/O optimality, suitable for deep XML Some streams may be accessed for a single query node Path-based partitioning More I/O optimality, suitable for shallow XML A path with //-axes may require accessing many streams for a single query node CIKM'09, Hong Kong 6 HTJ on Different Partitioning Schemes
7. CIKM'09, Hong Kong 7 Bitmap Index How to partition tuples in NODE table By building a bitmap index on certain column(s) in the table. bitTag for tagName, bitTag+ for (tagName, Level), bitPath for pathId column Determines I/O optimality of holistic twig join algorithms. During twig join process, useful tuples are accessed via the bitmap index. A B E . . . 110000 1 0 0 0 0010000100 0000010000 Bit-vectors . . . disk blocks
8. bitAnc : A bit-vector represents terminal elements corr. to a certain path and all their ancestors. bitDesc: A bit–vector represents terminal elements corr. to a certain path and all their descendants. CIKM'09, Hong Kong 8 Additional Indexes a1 0 a2 a3 a4 1 6 11 b1 2 7 12 b2 b3 14 e2 d3 8 c3 13 A subtree covered by the left 3 bit-vectors bitPath,bitAnc, andbitDescfor PathId=2, i.e. /A/A/B
9. Basic index Bit-vectors are built on a single column or a group of columns Requires labeled values, and reading records Hybrid index A Combination of two different indexes descTag : bitDesc & bitTag bitTwig : bitPath & bitAnc does not require labeled values to compute twig solution CIKM'09, Hong Kong 9 Two Types of Indexes
10.
11. Choose the minimum position value among the current 1’s as a current element for a query node Check if 1 exists in an interval, pos(a) and pos(d)? looking-ahead at the next 1 CIKM'09, Hong Kong 11 Advancing Cursors 0 eov P0 : /A P1 : /A/A q : //A (0,0,1) 6 1 Currq Current1 Next1
12. Early detection with a bit-vector absence Condensing query nodes For path-based partition Reduces |INDEX| and |RECORD| Skipping reading obsolete records with advance(k) For tag, (tag, level)-based partition Reduces |RECORD| Moving cursors over compressed bit-vectors with no decompression A composite cursor moving over a bit-vector compressed by run-length encoding scheme Reduces |INDEX| CIKM'09, Hong Kong 12 Optimizations A A E B E C C P: //A/B/C CA = 11 10000000000100000 CB = 4 advance(11) 00001000010000100
13. CIKM'09, Hong Kong 13 Compressed Bit-vector 000100000000100000000000000011 00000000000 . . . 00000000000000 0000000000000000000000000000001 00 (a) An original bit-vector with 8,000 bits 31 bits 2 bits 256* 31 bits 31 bits (b) Grouping as a unit of 31 bits and Merging identical groups 000010…010…011 100… 0100000000 000…001 000…000 Run-length is 256 31 literal bits Remaining word Uncompressed word Compressed word (c) Encoding each group as 1 word (4byte on a 32-bit machine) Cursor C ={ C.position, //Integer position value (Logical address) C. word, // The current word C is located at. C.bit, // The position of the bit C is visiting, in C.word C. rest } //The bit position in the remaining word
14. CIKM'09, Hong Kong 14 Moving A Cursor over A Compressed Bit-vector a) Get the position of the next 1 C = {31, 0, 31,0} Skip to examine 31* 256 bits C={7998, 2, 31, 0} 000010…010…011 100… 0100000000 000…001 000…000 Remaining word Run-length is 256 b) Check a bit value at the position 3,000 C = {31, 0, 31,0} with distance to move, 2,869=(3000-31) Since 31* 256 > 2,869, The bit we find is within the word 1. 000010…010…011 100… 0100000000 000…001 000…000
15. CIKM'09, Hong Kong 15 Experiments Datasets Synthetic : XMark Real : DBLP, Treebank, Swiss-prot Query sets
21. Merging used bit-vectors for a path pattern with //-axes and putting it into a bitmap index for the next time for a given path //A//B, P:/A/A/B P:/A/B acts like a pre-computed join index A path pattern with //-axes can be represented by a single bit-vector. Logical operations: OR, NOT are simply supported by bitwise-logical operations: &, |, ^ CIKM'09, Hong Kong 19 Other Features on bitPath
22. CIKM'09, Hong Kong 20 Twig Queries with Logical Operations P//A, P//A//B//X ≡P//A//B//C V P//A//B//D , P//A//E A A A A B E B E X (C|D) //A[./B/C or ./B/D]//E P//A , P//A//E , P//A/B ⓧ(P//A/B ⊙A//A/B/C) A A A A A B B E E B C ¬ C //A[./B/not(C)]//E
23. We investigated the possibilities of bitmap indexes for XML query processing Partitioning XML stored in RDB in various ways Cursor movements do not require decompression of bit-vectors We devised a way to identify element relationship with only bitmap index, bitTwig Our experiments showed that bitTwig was best for queries against shallow XML documents For deep XML documents, bitTag/w advance(k) showed the best performance. Future work: evaluating our system with more HTJ algorithms and other indexes CIKM'09, Hong Kong 21 Conclusions