In this paper, we propose new models and algorithms to perform practical computations on W3C XML Schemas, which are schema minimization, schema equivalence testing, subschema testing and subschema extraction. We have conducted experiments on an e-commerce standard XSD called xCBL to demonstrate the e?ectiveness of our algorithms. One experiment has refuted the claim that the xCBL 3.5 XSD is compatible with the xCBL 3.0 XSD. Another experiment has shown that the xCBL XSDs can be effectively trimmed into small subschemas for specific applications, which has significantly reduced schema processing time.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
1. XML Schema Computations: Schema
Compatibility Testing and Subschema Extraction
Thomas Y.T. LEE and David W.L. Cheung
Department of Computer Science
The University of Hong Kong
October 28, 2010
CIKM 2010
Toronto, Canada
1
2. Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
2
3. Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
3
4. Data interoperability on web services
In order for two web services to be interoperable , the XML
schema on the message receiving end must accept all possible
XML messages from the sending end.
The sending schema must be a subschema of the receiving
schema.
_
∩
XML XML
Instances Instances
Schema A Schema B
Web Web
Service Service
A B
4
5. W3C XML Schema and data standards
1. W3C XML Schema (XSD) is the most popular schema
language to define data standards.
2. In order for the new version of an XSD to be
backward-compatible with the old version, the new version
must be a superschema of the old version.
The new schema must accept every instance of the old
schema.
3. However, a typical e-commerce standard XSD contains
thousands of types / elements, which makes manual
verification of compatibility hardly possible.
4. When an XSD is too large, how can we extract a smaller
subschema just enough for processing by a specific
application?
5
6. Schema compatibility problems
1. Given two XSDs, how to verify two XSDs are equivalent or
one is a subschema of the other?
2. Given XSD A , how to extract a smaller subschema of A called
B so that B recognizes only a subset of elements recognized
by A ?
3. In this research, we have developed the formal models for
XML data and schemas, as well as the algorithms to solve
these problems.
6
7. Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
7
8. Data Tree (DT) to model XML data
A DT is a tree where edges represent elements and nodes
represent their contents.
<Quote> n0:ε
<Line> <Quote>
<Desc>hPhone</Desc>
<Price>499.9</Price> n1:ε
</Line>
<Line> <Line>
<Line>
<Desc>iMat</Desc> n2:ε n3:ε
<Price>999.9</Price>
<Desc> <Price> <Desc> <Price>
</Line>
</Quote> n4: n5: n6: n7:
"hPhone" "499.9" "iMat" "999.9"
8
9. Schema Automaton (SA) to model XML schemas
1. An SA is a deterministic finite automaton (DFA) where each
state is associated with a regular expression (RE) and a set of
values called value domain (VDom)
2. The DFA called vertical language (VLang) defines how the
symbols are arranged along the paths from the root to the
leaves.
2.1 Each state represents an XSD data type and each symbol
represents an element name.
3. The RE of a state called horizontal language (HLang)
defines how child elements can be arranged under an XSD
data type, i.e., content model.
4. The value domain defines the set of all possible values an
element can contain.
9
11. Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
11
12. Schema compatibility testing
1. Schema equivalence testing and subschema testing .
2. A schema minimization is involved.
2.1 All useless states (data types) are removed first. A useless
state is an inaccessible state or a state which does not
recognize any element with a finite number of descendants.
2.2 The process is like a DFA minimization but the HLang and
VDom of each state are considered when deciding whether
two states can be merged.
3. We have proved that two SAs (XSDs) are equivalent iff their
minimized forms have isomorphic VLang DFAs and all
corresponding HLangs and VDoms are equivalent .
4. We have developed an algorithm to verify whether an SA is a
subschema of another SA.
12
13. Useless states
B q2
A
A
q0 A q7 q8
C q3 B
q1
C B C
q4 q5 A B
q6 q9
q HLang(q) VDom(q) q HLang(q) VDom(q)
q0 A{2,5}BC? STRINGS q5 C STRINGS
q1 C* STRINGS q6 A+B* INTEGERS
q2 { } INTEGERS q7 A? STRINGS
q3 A* STRINGS q8 B* STRINGS
q4 B+ STRINGS q9 { } DECIMALS
1. q7 and q8 are inaccessible.
2. q5 and q6 are irrational because they generate infinite children.
3. q9 is useless because it is blocked by irrational states.
4. q4 is useless because it must lead to an irrational state.
13
14. Schema minimization and equivalence
q HLang(q) VDom(q)
q0 Quote | Order { }
Schema A q1 Line + { }
<Line> q3 <Desc> q2 Line + { }
<Quote> q1
<Price> q5
q0 <Order> q3 Desc Price { }
<Line> <Qty>
q2 q4 q8 <Desc> q4 Product Qty { }
<Product>
<Price> q6 q5 { } STRS
q7 q6 { } DECS
q7 Desc Price { }
q8 { } INTS
q4 Product Qty { }
1. q3 and q7 can be merged into q9.
2. Two SAs are equivalent. q HLang(q) VDom(q)
q0 Quote | Order { }
<Desc> q5
<Line>
q1 Line + { }
<Quote> q1 q9 <Price>
<Order> <Product> q2 Line + { }
q0
<Line> q6 q9 Desc Price { }
q2 q4 <Qty>
q8 q4 Product Qty { }
q5 { } STRS
Schema B q6 { } DECS
q8 { } INTS
14
15. Subschema testing
q HLang(q) VDom(q)
Schema A q0 Quote | Order { }
q1 Line + { }
<Desc> q5
q2 Line + { }
q1 <Line>
<Quote> q9 <Price>
<Order> <Product> q9 Desc Price { }
q0
<Line> q6 q4 Product Qty { }
q2 q4 <Qty>
q8 q5 { } STRS
q6 { } DECS
q8 { } INTS
B is a subschema of A.
1. HLang(q0B ) ⊆ HLang(q0A ) and VDom(q0B ) = VDom(q0A ).
2. HLang(q6B ) = HLang(q6A ) and VDom(q6B ) ⊆ VDom(q6A ).
3. HLang(qiB ) = HLang(qiA ) and VDom(qiB ) = VDom(qiA ), for i = 1.5, 9.
q HLang(q) VDom(q)
<Desc> q5
q0 Quote { }
<Quote> <Line>
q0 q1 q9 <Price> q1 Line + { }
q6 q9 Desc Price { }
q5 { } STRS
Schema B q6 { } INTS
15
16. Subschema extraction
We have developed the subschema extraction algorithm:
Given SA (XSD) A and a set of symbols (element names) Z,
compute an SA which accepts all instances (XML documents)
of A except those containing some symbols not in Z.
<Desc> q4
q1 <Line>
<Quote> q2 <Price>
q0 <Order> <Product>
<Line> q5
q7 q3 <Qty>
q6
q HLang(q) VDom(q) q HLang(q) VDom(q)
q0 <Quote>|<Order> { } q3 <Product><Qty> { }
q1 <Line>+ { } q4 { } STRINGS
q7 <Line>+ { } q5 { } DECIMALS
q2 <Desc><Price> { } q6 { } INTEGERS
Z = {<Quote>, <Line>, <Desc>, <Price>, <Order>, <Qty>}, where <Product> is
excluded.
16
17. Outline
Introduction and motivation
Formal models for XML data and schemas
Schema computational algorithms
Experiments and conclusions
17
18. xCBL compatibility testing experiment
1. Data sets: XML Common Business Library
file no. of data element doc.
XSD size files types names types
xCBL 3.0 1.8MB 413 1,290 3,728 42
xCBL 3.5 2.0MB 496 1,476 4,473 51
2. The subschema testing program has disproved the claim on
xCBL.org:
The only modifications allowed to xCBL 3.0 documents were the
additions of new optional elements and additions to code lists; to
maintain interoperability between the two versions. An xCBL 3.0
instance of a document is also a valid instance in xCBL 3.5.
3. xCBL 3.5 is not a superschema of xCBL 3.0.
4. The experiment took only 272ms when the quick RE test
was applied.
Machine: Q6600@2.40GHz, 4GB RAM, Linux OS
18
19. Schema size reduction by subschema extraction
1. The subschema extraction program was run to extract
different subschemas from xCBL. Each subschema
recognizes a different element subset for a specific
application, e.g., order, invoice, etc.
2. The schema size was reduced to 6–32% of the original size.
3. The time required by XMLBeans to compile a subschema was
reduced to 34–50% of the time originally required.
4. The time to extract such a subschema was only 2–3s.
5000 35
#element names
#types 30
4000 #element declarations
XMLBeans compilation time 25
time (second)
3000
number
20
2000 15
10
1000
5
0 0
original invoice order quote auction catalog
Subschema extraction from xCBL 3.5.
19
20. Conclusions
1. We have developed:
formal models for XML and XSD, and
algorithms for schema equivalence and subschema testing,
and subschema extraction.
2. These algorithms are PSPACE-complete because of
comparions of regular expressions.
We have developed a heuristic (quick RE test) to make these
algorithms run fast on very large schemas.
3. Our experiments:
have proved that xCBL 3.5 is in fact not backward-compatible
with xCBL 3.0, and
have extracted small subschemas from xCBL for different
instance subsets, which largely reduce processing time on
these subschemas.
4. These models can be extended for other applications:
web service adaptor for legacy systems (text to XML
transformation), and
schema inferrer from XML instances.
20