Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
08448380779 Call Girls In Friends Colony Women Seeking Men
OPAL: automated form understanding for the deep web - WWW 2012
1. DIADEM domain-centric intelligent automated
data extraction methodology
OPAL:
Automated Form Understanding
for the Deep Web
Xiaonan Guo
DIADEM Group, Department of Computer Science, Oxford University
joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio
Orsi, Christian Schallhart
2. OPAL ›❯ Scenario
1
A scenario...
Looking for a house ?
Too many websites to check ?
Tired of searching through dozen of websites?
2
13. OPAL ›❯ Scenario
1
Automatic Form Understanding
Form Understanding := Form Labeling + Form
Form Labeling Form Interpretation
4
14. OPAL ›❯ Scenario
1
Automatic Form Understanding
Form Understanding := Form Labeling + Form
Form Labeling Form Interpretation
Field Scope
4
15. OPAL ›❯ Scenario
1
Automatic Form Understanding
Form Understanding := Form Labeling + Form
Form Labeling Form Interpretation
Segment Scope
Field Scope
4
16. OPAL ›❯ Scenario
1
Automatic Form Understanding
Form Understanding := Form Labeling + Form
Form Labeling Form Interpretation
Layout Scope
Segment Scope
Field Scope
4
17. OPAL ›❯ Scenario
1
Automatic Form Understanding
Form Understanding := Form Labeling + Form
Form Labeling Form Interpretation
Layout Scope Domain Scope
Segment Scope
Field Scope
4
18. OPAL ›❯ Scenario
1
Previous Approaches
Form Labeling
Layout Scope
Segment Scope
Field Scope
Only subset of scopes for form labeling
Only form labeling, no or little form interpretation
Where domain-specific, expensive to switch domain 5
19. OPAL ›❯ Scenario
1
Previous Approaches
Form Labeling
Dragut et al., VLDB’09
Layout Scope
Segment Scope
Khare et al, CIKM’09
Field Scope
Nguyen et al, VLDB’08
Only subset of scopes for form labeling
Only form labeling, no or little form interpretation
Where domain-specific, expensive to switch domain 5
20. OPAL ›❯ Scenario
1
OPAL
OPAL = multi-scope form labeling + flexible form
interpretation
1. form labeling with all three scopes
more accurate, more robust, less complex
2. form interpretation = matching and repair for elements to
concepts
3. template language and pattern library for defining form
ontologies
more succinct, more reuse, less work
6
21. OPAL ›❯ Scenario
1
OPAL
OPAL = multi-scope form labeling + flexible form
interpretation
1. form labeling with all three scopes
more accurate, more robust, less complex
2. form interpretation = matching and repair for elements to
concepts
3. template language and pattern library for defining form
1
ontologies
0.985
more succinct, more reuse, less work
>95%
0.97
0.955
0.94 6
UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)
22. OPAL ›❯ Form Labeling
2
Form Labeling
Form Labeling Form Interpretation
Domain Scope
Layout Scope
Segment Scope
Field Scope
7
23. OPAL ›❯ Form Labeling
2
Form Labeling
Form Labeling
Layout Scope
Segment Scope
Field Scope
7
24. OPAL ›❯ Form Labeling
2
Form Labeling
Form Labeling
Field Scope
7
25. OPAL ›❯ Form Labeling
2
Field Scope
simple heuristic to individual fields
assigns text nodes t to field f if
the least common ancestor of t and f is ancestor of no other
field
linear coloring algorithm
8
26. OPAL ›❯ Form Labeling
2
Field Scope
simple heuristic to individual fields
assigns text nodes t to field f if
the least common ancestor of t and f is ancestor of no other
field
linear coloring algorithm
8
27. OPAL ›❯ Form Labeling
2
Field Scope
simple heuristic to individual fields
assigns text nodes t to field f if
the least common ancestor of t and f is ancestor of no other
field
linear coloring algorithm
8
28. OPAL ›❯ Form Labeling
2
Field Scope
simple heuristic to individual fields
assigns text nodes t to field f if
the least common ancestor of t and f is ancestor of no other
field
linear coloring algorithm
field segment layout domain
Real-estate
Used-car
8
0.6 0.7 0.8 0.9 1
29. OPAL ›❯ Form Labeling
2
Segment Scope
Form Labeling
Field Scope
9
30. OPAL ›❯ Form Labeling
2
Segment Scope
Form Labeling
Segment Scope
Field Scope
9
31. OPAL ›❯ Form Labeling
2
Segment Scope
builds a segment tree
groups semantically likely related form elements, e.g. fields,
labels
assigns labels to fields and groups
form fields are grouped if
they occur in sequence
they have similar style
their least common ancestor contains
no other elements
10
32. OPAL ›❯ Form Labeling
2
Segment Scope
builds a segment tree
groups semantically likely related form elements, e.g. fields,
labels
assigns labels to fields and groups
form fields are grouped if
they occur in sequence
they have similar style
their least common ancestor contains
no other elements
10
33. OPAL ›❯ Form Labeling
2
Segment Tree
1 2 3 4 5
DOM Tree Segment Tree
Figure 3: Example DOM and Segment Tree
remove layout or template artifacts
artificially breaking sequences of style-equivalent fields
Algorithm 3: SegmentScopeLabeling(DOM P, Form Labeling F)
1 S SegmentTree(P) ; 11
35. OPAL ›❯ Form Labeling
2
Field & Segment Scope: Example
13
36. OPAL ›❯ Form Labeling
2
Layout Scope
Form Labeling
Segment Scope
Field Scope
14
37. OPAL ›❯ Form Labeling
2
Layout Scope
Form Labeling
Layout Scope
Segment Scope
Field Scope
14
38. OPAL ›❯ Form Labeling
2
Layout Scope
Aligns visually related texts and fields
prefers texts in the w-nw-n direction
considers overshadowing
T4
T2
F2
T1 NW N NE
T3 F3 W F1 E
SW S SE
15
42. OPAL ›❯ Form Labeling
3
Form Labeling
Form Labeling
Layout Scope
Segment Scope
Field Scope
17
43. OPAL ›❯ Form Labeling
2
Evaluation: ICQ & Tel8 Benchmark
Domain-independence Experiment—ICQ
Web query interfaces in 5 domains
100 query interfaces (20 in each domain)
Domain-independence Experiment—Tel8
Web query interfaces in 8 domains
477 in total
18
44. OPAL ›❯ Form Labeling
2
ICQ by Domain
1
0.98
0.96
0.94
0.92
0.9
Airfare Auto Book Job US R.E.
19
45. OPAL ›❯ Form Labeling
2
ICQ by Domain
1
0.98 >98%
0.96
0.94
0.92
0.9
Airfare Auto Book Job US R.E.
19
46. OPAL ›❯ Form Labeling
2
ICQ by Domain
1
0.98 >98%
0.96
0.94 Dragut et al.,
VLDB’09
0.92
0.9
Airfare Auto Book Job US R.E.
19
47. OPAL ›❯ Form Labeling
2
Tel-8 by Domain
0.99
0.96
0.93
0.9
Airfares Automobiles Books Car Rentals Hotels Jobs Movies Records
20
48. OPAL ›❯ Form Labeling
2
Tel-8 by Domain
>95%
0.99
0.96
0.93
0.9
Airfares Automobiles Books Car Rentals Hotels Jobs Movies Records
20
49. OPAL ›❯ Form Interpretation
3
Form Interpretation
Form Labeling
Layout Scope
Segment Scope
Field Scope
21
50. OPAL ›❯ Form Interpretation
3
Form Interpretation
Form Labeling Form Interpretation
Domain Scope
Layout Scope
Segment Scope
Field Scope
21
51. OPAL ›❯ Form Interpretation
3
Form Interpretation
22
52. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element
AreaBranch Element
AreaBranch Element
...
Price Element(min)
Price Element(max)
...
22
53. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element
AreaBranch Element
AreaBranch Element
...
Price Element(min)
Price Element(max)
...
22
54. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element Area-Branch
Segment
AreaBranch Element
AreaBranch Element
...
Price Element(min)
Price Element(max)
...
22
55. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element Area-Branch Geographic
Segment Segment
AreaBranch Element
AreaBranch Element
...
Price Element(min)
Price Element(max)
...
22
56. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element Area-Branch Geographic
Segment Segment
AreaBranch Element
AreaBranch Element
...
Price Element(min)
Price Element(max)
...
22
57. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element Area-Branch Geographic
Segment Segment
AreaBranch Element
AreaBranch Element
...
Price Element(min) Price
Price Element(max) Segment
...
22
58. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element Area-Branch Geographic
Segment Segment
AreaBranch Element
AreaBranch Element
...
Price Element(min) Price
Price Element(max) Segment
...
22
59. OPAL ›❯ Form Interpretation
3
Form Interpretation
...
AreaBranch Element
AreaBranch Element
AreaBranch Element Area-Branch Geographic
Segment Segment
AreaBranch Element
Real-
AreaBranch Element
Estate
... Form
Price Element(min) Price
Price Element(max) Segment
...
22
60. OPAL ›❯ Form Interpretation
3
Form Interpretation
Form interpretation in OPAL =
annotation + classification + repair
Annotation: straightforward entity recognition
Classification: maps fields to concepts based on
annotations
Repair: enforces structural constraints
Together: high precision disambiguation for form
classification
23
61. OPAL ›❯ Form Interpretation
3
OPAL Ontology
What does on OPAL ontology consist of?
Annotation Annotation types “number”, “price period” (p.c.m.)
Classification Concepts “price field”, “location field”
Repair Constraints “R.E. form must have a location field”
constraints:
classification: between annotations and concepts
structural: between concepts
24
62. OPAL ›❯ Form Interpretation
3
OPAL Ontology
Reducing the effort in designing an OPAL ontology
by exploiting ubiquitous patterns of web forms
e.g., “range” over a numeric type, dependent types
Halevy et al., VLDB’08 identify 7 such patterns
but: these patterns must be instantiated for a specific
domain
e.g., “price range”, “radius” linked with “location”,
“postcode” with “city”, …
OPAL’s solution: templates for these patterns
25
63. OPAL ›❯ Form Interpretation
3
Form Patterns Example
Small set of ubiquitous patterns
ranges, dates, options, etc.
Ontology by instantiation
26
64. OPAL ›❯ Form Interpretation
3
Form Patterns Example
Small set of ubiquitous patterns
ranges, dates, options, etc.
Ontology by instantiation
26
65. OPAL ›❯ Form Interpretation
3
Form Patterns Example
Small set of ubiquitous patterns
ranges, dates, options, etc.
Ontology by instantiation
26
66. OPAL ›❯ Form Interpretation
3
Form Patterns Example
Small set of ubiquitous patterns
ranges, dates, options, etc.
Ontology by instantiation
26
67. OPAL ›❯ Form Interpretation
3
Form Patterns Example
Small set of ubiquitous patterns
ranges, dates, options, etc.
Ontology by instantiation
26
68. OPAL ›❯ Form Interpretation
3
Form Patterns Example
Small set of ubiquitous patterns
ranges, dates, options, etc.
Ontology by instantiation
26
69. OPAL ›❯ Form Interpretation
3
OPAL-TL
Patterns are specified using OPAL-TL
OPAL-TL = Datalog + Annotation Queries + Templates
27
70. OPAL ›❯ Form Interpretation
3
Annotation Queries
Annotation types come in two varieties
“label” types and “value” types for “price” vs. “€120”
Disambiguation through fixed precedence on annotation
types
e.g., field with value “lowest price first”
both “price” and “order-by” annotation, but “order-by” <
“price”
28
71. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
2 A 4 A
A
B 3 A
C B
Figure 6: Example Form Labeling
N@A{ d, p, e }, “all nodes N having labels annotated with
A” are either provided by human domain experts or derived from e
d : direct (labels associated with the node itself) The current OPA
ternal sources such as DBPedia and Freebase.
version contains a large set of from field values) common doma
p : proper (from labels, but not
such artefacts for
types such as price, location, or date.
e : exclusive (labels considering precedence)
D EFINITION 11. Given a form labeling F on a DOM P and a
29
72. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
2 A 4 A
A
B 3 A
C B
Figure 6: Example Form Labeling
are either provided by human domain experts or derived from e
ternal sources such as DBPedia and Freebase. The current OPA
version contains a large set of such artefacts for common doma
types such as price, location, or date.
D EFINITION 11. Given a form labeling F on a DOM P and a
30
73. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
segment 2 A 4 A value label
A
B 3 A
field
C B proper label
Figure 6: Example Form Labeling
are either provided by human domain experts or derived from e
ternal sources such as DBPedia and Freebase. The current OPA
version contains a large set of such artefacts for common doma
types such as price, location, or date.
D EFINITION 11. Given a form labeling F on a DOM P and a
30
74. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
2 A 4 A
A
B 3 A
C B
Figure 6: Example Form Labeling
assuming B < A
are either provided by human domain experts or derived from e
ternal sources such as DBPedia and Freebase. The current OPA
version contains a large set of such artefacts for common doma
types such as price, location, or date.
D EFINITION 11. Given a form labeling F on a DOM P and a
30
75. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
2 A 4 A
A
B 3 A
C B
Figure 6: Example Form Labeling
assuming B < A
are either provided by human domain experts or derived from e
N@A{ } = 2, 3, 4
ternal sources such as DBPedia and Freebase. The current OPA
version contains a large set of such artefacts for common doma
types such as price, location, or date.
D EFINITION 11. Given a form labeling F on a DOM P and a
30
76. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
2 A 4 A
A
B 3 A
C B
Figure 6: Example Form Labeling
assuming B < A
are either provided by human domain experts or derived from e
N@A{ } = 2, 3, 4
ternal sources such as DBPedia and Freebase. The current OPA
N@A{ d, e } contains a large set of such artefacts for common doma
version = 2, 4
types such as price, location, or date.
D EFINITION 11. Given a form labeling F on a DOM P and a
30
77. OPAL ›❯ Form Interpretation
3
Annotation Queries
1
2 A 4 A
A
B 3 A
C B
Figure 6: Example Form Labeling
assuming B < A
are either provided by human domain experts or derived from e
N@A{ } = 2, 3, 4
ternal sources such as DBPedia and Freebase. The current OPA
N@A{ d, e } contains a large set of such artefacts for common doma
version = 2, 4
N@A{ p, esuch as price, location, or date.
types } = 2, 3
D EFINITION 11. Given a form labeling F on a DOM P and a 30
78. OPAL ›❯ Form Interpretation
3
OPAL-TL Example
Price range
two successive fields in the same group
at least one “price” type
TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
2 range connector in between
TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} }
4
TEMPLATE concept_minmax<C,CM ,A> {
6 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
8 concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ),
concept<C>(N1 ),N2 @range_connector{e,d},¬(A1 A, N2 @A1 {d})
10 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
12 _ (N1 @max{e,p},N2 @min{e,p}) 31
79. OPAL ›❯ Form Interpretation
3
OPAL-TL Example
Price range
two successive fields in the same group
at least one “price” type
TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
2 range connector in between
TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} }
4
TEMPLATE concept_minmax<C,CM ,A> {
6 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
8 concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ),
concept<C>(N1 ),N2 @range_connector{e,d},¬(A1 A, N2 @A1 {d})
10 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
12 _ (N1 @max{e,p},N2 @min{e,p}) 31
80. OPAL ›❯ Form Interpretation
3
OPAL Repair
enforces structural constraints
by repairing 4 types of violations
under segmentation: missing segments
over segmentation: superfluous segments
under classification: missing types
over classification: superfluous types
32
81. OPAL ›❯ Evaluation
3
OPAL: Evaluation
Form Labeling Form Interpretation
Domain Scope
Layout Scope
Segment Scope
Field Scope
33
82. OPAL ›❯ Evaluation
4
Datasets
Domain-awareness Experiment
UK real estate and used car domain
100 web forms sampled for each domain
Domain-independence Experiment—ICQ
Web query interfaces in 5 domains
100 query interfaces (20 in each domain)
Domain-independence Experiment—Tel8
Web query interfaces in 8 domains
477 in total
34
83. OPAL ›❯ Evaluation
4
Evaluation Summary
Precision Recall F-score
1
0.985
0.97
0.955
0.94
UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)
35
85. results (c)
OPAL ›❯ Evaluation
4
AL comparison
OPAL: Performance (incl. Browser)
1
200
Total Computation time [s]
150
100 0.8
50
0
0 2000 4000 6000 8000 0.6
Number of nodes 37
86. results (c)
OPAL ›❯ Evaluation
4
AL comparison
OPAL: Performance (incl. Browser)
1
200
>70% below 50s
Total Computation time [s]
150
100 0.8
50
0
0 2000 4000 6000 8000 0.6
Number of nodes 37
87. OPAL ›❯ Conclusion
4
OPAL conclusion
multi-scope domain independent form labeling
combines visual, textual, and structural features
expands form labeling with field, segment, page scope
domain-aware form interpretation
classifies form fields and repairs domain form model
disambiguates form classification (up to 10% gain in
precision)
template language OPAL-TL
provides a library of ubiquitous form patterns
significantly speeds up domain instantiation
38
88. OPAL ›❯ Conclusion
4
Future work
Integrate form understanding & record extraction (e.g.,
AMBER)
“Site scope”
Semi-automatic ontology creation for new domains
tailor ontology learning approaches to forms
Integrate form constraint extraction
from probing and Javascript analysis (ProFound demo)
39
89. DIADEM ›❯ OPAL
OPAL: A Passe-partout for Web Forms
DIADEM domain-centric intelligent automated
data extraction methodology
Authors
Xiaonan Guo, Jochen Kranzdorf,
Digital Home
diadem.cs.ox.ac.uk/opal/
Sponsors
Tim Furche, Giovanni Grasso,
Giorgio Orsi, Christian Schallhart
diadem-opal@cs.ox.ac.uk
b-node
Segment Scope
Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web
— segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form
• form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields
• at least degree two and all fields in s are style-equivalent
T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts
4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them
• to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain
T1 NW N NE
2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology
T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge
— style-equivalence: two nodes and
2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge
— — with domain knowledge we achieve nearly perfect accuracy in multiple domains
1 1 3 7 • if same class or same type and CSS style SW S
1 3 6 Figure 5: Layout Scope
— complexity: linear in document size and depth Labeling
Figure 4: Example for Segment Scope Labeling
DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or
Segment tree Layout tree Schema tree
ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an-
ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ).
those text nodes as labels for these nodes, preserving any already
Thank You
assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels
To illustrate this overshadowing, consider
ure 5. For field F , T
Segments
by F and T by F
Visual Labels Form Model
over all descendants c of each segment in document order, skip- 1 2 4 2 3 3
ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is
itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field.
segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each
Field Scope
segment nodes in Labels, except those text nodes already assigned Segment Scope
field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope
as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not
in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents
group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments.
using the first text node as labels of the segment, line 17–19).
Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION
denoting Visual heuristics: Find labels inblack circles segments, and
text nodes, diamonds fields, visual proximity of a field
There is no straightforward relationship between form fields for
T4
white circles DOM “overshadowed” segment tree. The numbers in-
— but: not nodes not in the by another field
T2
domain concepts, such as location or price, and their structure within
dicate which text nodes are assigned as labels tofield in reading order
— but: preference for labels preceding a which segments or F2
a form. Even seemingly domain-independent concepts, such as
fields. E.g., for the left hand segment, we observe a regular struc-
price,Toften exhibit domain NE specific peculiarities, such as “guide
ture of (text node+, field)+ and thuspwe assign the i-th group of
— t visible from a point if north-west of p 1 NW N
price”, “current offers in excess”, or payment periods in real es-
text nodes to the i-th field. For the for t if hand segment (4), we find T3
— f overshadows f’ right t visible from F3 W F1 E
tate. OPAL’s domain schemata allow us to cover these specifics.
a subsegment (5) • bottom-right corner of f and f, f’ unaligned node
and field 8 that is already labeled with text
We recall from Section 2 that a form model (F 0 , t) for a schema S
SW S SE
8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned
bottom-left corner of and one text node re-
is derived from a form labeling F by extending F with types and
restructuring its inner nodes to fit the structuralFilling: Aof S.
Form constraints Passe-partout for the Web Experiments Applications of OPAL
mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f
• bottom-right corner label. aligned find
one more text node group than fields and visualconsider the first text
has no other thus label
form filling
OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout)
Master form serves in two Search
node group as a segment label. The remaining nodes have a regular 1
steps: (1) the classification of nodes in — according free text values used directly for free text inputs
F currently: to the domain —‘key’ to the web Data Extraction
structure (field, text node+)+ and get assigned accordingly. 0.985
Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes
— step relies on the anno- — all types of web
3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it
where the segmentation structure derived in the segmentation scope
At layout scope, we further refine the form labeling for each ➊ URL
(Section 3.2) is aligned with the structure constraints of S.
➊ 0.955
form field not yet labelled in field or segment scope, by exploring (a)0.94
the visible text nodes in the west, north-west, or north quadrant,
➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)
Assistive
Domain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL
Controller (b) Devices
if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
OPAL provides a template language, OPAL - TL, for easily speci- 1
OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes:
constructs layout tree from the CSS box labels of queries
domain schemata reusing common Web page and their con- segment<C,A> { ➌
(c)
2
fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98
— templates = common9. The layout forms, e.g.,given DOM P is a tuple
D EFINITION pattern in web tree of a range specification
straints as well as concept templates. To implement aconcept_minmax<C,C ,A> {
A
4
TEMPLATE
new do- (d)
0.96 (e)
— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM
(NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M
P main, we only need to provide (1) a set of annotators implementing
6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94
(f)
nodes from P, , w, nw, n, . . . the “belongs to” (containment), west,
TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do-
B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
➍ 0.92
Ⓐ
8 concept<CM >(N (
2
north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d})
segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G),
Repairing form main types and their classification
Figure 6: Example Form Labeling
,N1 ),
¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope
holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun-
form interpretation F, that labels annotated with certain type templates and predefined predi-
OPAL - not necessarily a model with 10 concept<CM >(N1 )
Ⓑ
( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E.
and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
Ⓑ
4
der S,direct?either provided by human domain experts or We
are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input
TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1
segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p})
We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The
Figure 4: Holbrook Moran form and page-scope labeling
(Touch-Screens)
6 filled 12 _ 1 2 field segment layout domain
• proper? look for labels (‘price’) or also values (‘£10’)
most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM
be computed majority describedprogramconstruct a form
the rewriting rules large set of such is executed against a form
OPAL - TL
N1 6= N2 ,child(N1 ,G),child(N2 ,G) }
according to master
8
modelexclusive? as price,Relationstheof the type P strati-
compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates
P. performsorfrom rewriting in
TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the
in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation
fied manner to guarantee termination only use child (descendant, resp.) for the child (descen-
and introduces at most n new
D EFINITION TL. We ain the form.
segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing
10
union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an
Given
¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99
In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con-
• template variablesIfquantify over predicates or We is an expres-
(1) Under Segmentation: L, resp.) relation intype such
annotation dant, an a segment n with F. types As Precision domain
12
provides passe- 0.98
that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N
straints that associate Used-car
TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if
from P segments
sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout
preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions
0 ,P),¬(segment<C>(G0 from a field. How-
0.8
A 2 A, and d, P and e are annotation modifiers. An annotation for filling
partout is labeled by an exclusive direct and proper annotation of type A. F-score segment
p,
ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-
14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs
radius>} 0.97
0.6 0.7 0.8 0.9 1 field
P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms
Pn such µ ✓ and Pe} (X,Y ), |=
segment labels. Thusstructural constraints
Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as
For each Pi we add
der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa.
Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96
Figure 3: OPAL Interface
@Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A)
Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X).
Finally, to from n to that segment. / 0.95 0.6
with ti , and move
In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate
Real-estate Used car
Used-car Real-estate Used-car
first two templates). It is0 the only template 0with two concept tem- 0
Notas del editor
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
= ontology-based web form pattern analysis with logic \n