SlideShare una empresa de Scribd logo
1 de 91
DIADEM           domain-centric intelligent automated
                 data extraction methodology




                                                          OPAL:
  Automated Form Understanding
               for the Deep Web
                                                        Xiaonan Guo
  DIADEM Group, Department of Computer Science, Oxford University
 joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio
                                               Orsi, Christian Schallhart
OPAL ›❯ Scenario
1


    A scenario...
      Looking for a house ?


      Too many websites to check ?


      Tired of searching through dozen of websites?




                                                      2
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    The Diversity of Forms




                             3
OPAL ›❯ Scenario
1


    Automatic Form Understanding
        Form Understanding := Form Labeling + Form

    Form Labeling                              Form Interpretation




                                                                 4
OPAL ›❯ Scenario
1


    Automatic Form Understanding
        Form Understanding := Form Labeling + Form

    Form Labeling                              Form Interpretation



       Field Scope




                                                                 4
OPAL ›❯ Scenario
1


    Automatic Form Understanding
        Form Understanding := Form Labeling + Form

    Form Labeling                              Form Interpretation

                Segment Scope
       Field Scope




                                                                 4
OPAL ›❯ Scenario
1


    Automatic Form Understanding
        Form Understanding := Form Labeling + Form

    Form Labeling                              Form Interpretation
                          Layout Scope
                Segment Scope
       Field Scope




                                                                 4
OPAL ›❯ Scenario
1


    Automatic Form Understanding
        Form Understanding := Form Labeling + Form

    Form Labeling                              Form Interpretation
                          Layout Scope            Domain Scope
                Segment Scope
       Field Scope




                                                                 4
OPAL ›❯ Scenario
1


    Previous Approaches
    Form Labeling
                           Layout Scope
                 Segment Scope
        Field Scope




         Only subset of scopes for form labeling
         Only form labeling, no or little form interpretation
         Where domain-specific, expensive to switch domain       5
OPAL ›❯ Scenario
1


    Previous Approaches
    Form Labeling
                                                 Dragut et al., VLDB’09
                           Layout Scope
                 Segment Scope
                                                 Khare et al, CIKM’09
        Field Scope

                                                 Nguyen et al, VLDB’08




         Only subset of scopes for form labeling
         Only form labeling, no or little form interpretation
         Where domain-specific, expensive to switch domain            5
OPAL ›❯ Scenario
1


    OPAL
      OPAL = multi-scope form labeling + flexible form
      interpretation
      1. form labeling with all three scopes
            more accurate, more robust, less complex

      2. form interpretation = matching and repair for elements to
         concepts
      3. template language and pattern library for defining form
         ontologies
            more succinct, more reuse, less work




                                                                     6
OPAL ›❯ Scenario
1


    OPAL
      OPAL = multi-scope form labeling + flexible form
      interpretation
      1. form labeling with all three scopes
            more accurate, more robust, less complex

      2. form interpretation = matching and repair for elements to
         concepts
      3. template language and pattern library for defining form
          1
         ontologies
        0.985
                more succinct, more reuse, less work

                                            >95%
         0.97


        0.955


         0.94                                                                     6
                UK Real Estate (100) UK Used Car (100)   ICQ (98)   Tel-8 (436)
OPAL ›❯ Form Labeling
2


    Form Labeling

Form Labeling                          Form Interpretation
                                           Domain Scope
                        Layout Scope
              Segment Scope
     Field Scope




                                                          7
OPAL ›❯ Form Labeling
2


    Form Labeling

Form Labeling
                        Layout Scope
              Segment Scope
     Field Scope




                                       7
OPAL ›❯ Form Labeling
2


    Form Labeling

Form Labeling



     Field Scope




                             7
OPAL ›❯ Form Labeling
2


Field Scope
     simple heuristic to individual fields
     assigns text nodes t to field f if
        the least common ancestor of t and f is ancestor of no other
        field
        linear coloring algorithm




                                                                       8
OPAL ›❯ Form Labeling
2


Field Scope
     simple heuristic to individual fields
     assigns text nodes t to field f if
        the least common ancestor of t and f is ancestor of no other
        field
        linear coloring algorithm




                                                                       8
OPAL ›❯ Form Labeling
2


Field Scope
     simple heuristic to individual fields
     assigns text nodes t to field f if
        the least common ancestor of t and f is ancestor of no other
        field
        linear coloring algorithm




                                                                       8
OPAL ›❯ Form Labeling
2


Field Scope
     simple heuristic to individual fields
     assigns text nodes t to field f if
        the least common ancestor of t and f is ancestor of no other
        field
        linear coloring algorithm
                      field   segment         layout     domain



       Real-estate




         Used-car


                                                                       8
                     0.6    0.7        0.8            0.9        1
OPAL ›❯ Form Labeling
2


    Segment Scope

Form Labeling



     Field Scope




                             9
OPAL ›❯ Form Labeling
2


    Segment Scope

Form Labeling


              Segment Scope
     Field Scope




                              9
OPAL ›❯ Form Labeling
2


    Segment Scope
      builds a segment tree
         groups semantically likely related form elements, e.g. fields,
         labels
         assigns labels to fields and groups
      form fields are grouped if
         they occur in sequence
         they have similar style
         their least common ancestor contains
         no other elements



                                                                         10
OPAL ›❯ Form Labeling
2


    Segment Scope
      builds a segment tree
         groups semantically likely related form elements, e.g. fields,
         labels
         assigns labels to fields and groups
      form fields are grouped if
         they occur in sequence
         they have similar style
         their least common ancestor contains
         no other elements



                                                                         10
OPAL ›❯ Form Labeling
2


    Segment Tree




    1         2          3           4   5



                  DOM Tree                     Segment Tree

                  Figure 3: Example DOM and Segment Tree
             remove layout or template artifacts
             artificially breaking sequences of style-equivalent fields
        Algorithm 3: SegmentScopeLabeling(DOM P, Form Labeling          F)
    1    S    SegmentTree(P)     ;                                       11
5


                            2                       5       6               7   7


            1       1           2   3                               6

    1                                       3




                                                                            5

1       1       1       2       2       3       3       5       6       6       7   7
                                                                                        12
OPAL ›❯ Form Labeling
2


    Field & Segment Scope: Example




                                     13
OPAL ›❯ Form Labeling
2


    Layout Scope

Form Labeling


              Segment Scope
     Field Scope




                              14
OPAL ›❯ Form Labeling
2


    Layout Scope

Form Labeling
                        Layout Scope
              Segment Scope
     Field Scope




                                       14
OPAL ›❯ Form Labeling
2


    Layout Scope
      Aligns visually related texts and fields
         prefers texts in the w-nw-n direction
         considers overshadowing


                                                 T4
                         T2
                                   F2

                  T1                    NW       N    NE

                    T3        F3        W    F1       E
                                        SW       S    SE
                                                           15
OPAL ›❯ Form Labeling
2


    Layout Scope Example




                             16
OPAL ›❯ Form Labeling
2


    Layout Scope Example




                             16
OPAL ›❯ Form Labeling
2


    Layout Scope Example




                             16
OPAL ›❯ Form Labeling
3


    Form Labeling

Form Labeling
                        Layout Scope
              Segment Scope
     Field Scope




                                       17
OPAL ›❯ Form Labeling
2


    Evaluation: ICQ & Tel8 Benchmark
      Domain-independence Experiment—ICQ
         Web query interfaces in 5 domains
         100 query interfaces (20 in each domain)
      Domain-independence Experiment—Tel8
         Web query interfaces in 8 domains
         477 in total




                                                    18
OPAL ›❯ Form Labeling
2


    ICQ by Domain


    1

0.98

0.96

0.94

0.92

 0.9
             Airfare       Auto   Book   Job   US R.E.



                                                         19
OPAL ›❯ Form Labeling
2


    ICQ by Domain


    1

0.98                              >98%
0.96

0.94

0.92

 0.9
             Airfare       Auto    Book   Job   US R.E.



                                                          19
OPAL ›❯ Form Labeling
2


    ICQ by Domain


    1

0.98                              >98%
0.96

0.94                                            Dragut et al.,
                                                  VLDB’09
0.92

 0.9
             Airfare       Auto    Book   Job       US R.E.



                                                                 19
OPAL ›❯ Form Labeling
2


    Tel-8 by Domain


0.99



0.96



0.93



 0.9
         Airfares   Automobiles   Books   Car Rentals   Hotels   Jobs   Movies   Records


                                                                                       20
OPAL ›❯ Form Labeling
2


    Tel-8 by Domain



                                          >95%
0.99



0.96



0.93



 0.9
         Airfares   Automobiles   Books   Car Rentals   Hotels   Jobs   Movies   Records


                                                                                       20
OPAL ›❯ Form Interpretation
3


    Form Interpretation

Form Labeling
                        Layout Scope
              Segment Scope
     Field Scope




                                       21
OPAL ›❯ Form Interpretation
3


    Form Interpretation

Form Labeling                          Form Interpretation
                                           Domain Scope
                        Layout Scope
              Segment Scope
     Field Scope




                                                          21
OPAL ›❯ Form Interpretation
3


Form Interpretation




                                  22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)
                              Price Element(max)
                                      ...




                                                   22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)
                              Price Element(max)
                                      ...




                                                   22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element    Area-Branch
                                                     Segment
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)
                              Price Element(max)
                                      ...




                                                                 22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element    Area-Branch   Geographic
                                                     Segment      Segment
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)
                              Price Element(max)
                                      ...




                                                                              22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element    Area-Branch   Geographic
                                                     Segment      Segment
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)
                              Price Element(max)
                                      ...




                                                                              22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element    Area-Branch   Geographic
                                                     Segment      Segment
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)      Price
                              Price Element(max)    Segment
                                      ...




                                                                              22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element    Area-Branch   Geographic
                                                     Segment      Segment
                             AreaBranch Element
                             AreaBranch Element
                                     ...

                              Price Element(min)      Price
                              Price Element(max)    Segment
                                      ...




                                                                              22
OPAL ›❯ Form Interpretation
3


Form Interpretation


                                      ...

                             AreaBranch Element
                             AreaBranch Element
                             AreaBranch Element    Area-Branch   Geographic
                                                     Segment      Segment
                             AreaBranch Element
                                                                              Real-
                             AreaBranch Element
                                                                              Estate
                                     ...                                      Form
                              Price Element(min)      Price
                              Price Element(max)    Segment
                                      ...




                                                                                 22
OPAL ›❯ Form Interpretation
3


    Form Interpretation
      Form interpretation in OPAL =
         annotation + classification + repair


      Annotation: straightforward entity recognition
      Classification: maps fields to concepts based on
      annotations
      Repair: enforces structural constraints


      Together: high precision disambiguation for form
      classification
                                                         23
OPAL ›❯ Form Interpretation
3


    OPAL Ontology
      What does on OPAL ontology consist of?

     Annotation     Annotation types     “number”, “price period” (p.c.m.)


    Classification       Concepts           “price field”, “location field”


       Repair          Constraints     “R.E. form must have a location field”


       constraints:
          classification: between annotations and concepts
          structural: between concepts
                                                                             24
OPAL ›❯ Form Interpretation
3


    OPAL Ontology
      Reducing the effort in designing an OPAL ontology
         by exploiting ubiquitous patterns of web forms
            e.g., “range” over a numeric type, dependent types
         Halevy et al., VLDB’08 identify 7 such patterns
         but: these patterns must be instantiated for a specific
         domain
            e.g., “price range”, “radius” linked with “location”,
            “postcode” with “city”, …


      OPAL’s solution: templates for these patterns

                                                                    25
OPAL ›❯ Form Interpretation
3


    Form Patterns Example




     Small set of ubiquitous patterns
       ranges, dates, options, etc.
     Ontology by instantiation

                                        26
OPAL ›❯ Form Interpretation
3


    Form Patterns Example




     Small set of ubiquitous patterns
       ranges, dates, options, etc.
     Ontology by instantiation

                                        26
OPAL ›❯ Form Interpretation
3


    Form Patterns Example




     Small set of ubiquitous patterns
       ranges, dates, options, etc.
     Ontology by instantiation

                                        26
OPAL ›❯ Form Interpretation
3


    Form Patterns Example




     Small set of ubiquitous patterns
       ranges, dates, options, etc.
     Ontology by instantiation

                                        26
OPAL ›❯ Form Interpretation
3


    Form Patterns Example




     Small set of ubiquitous patterns
       ranges, dates, options, etc.
     Ontology by instantiation

                                        26
OPAL ›❯ Form Interpretation
3


    Form Patterns Example




     Small set of ubiquitous patterns
       ranges, dates, options, etc.
     Ontology by instantiation

                                        26
OPAL ›❯ Form Interpretation
3


OPAL-TL
     Patterns are specified using OPAL-TL


     OPAL-TL = Datalog + Annotation Queries + Templates




                                                          27
OPAL ›❯ Form Interpretation
3


    Annotation Queries
      Annotation types come in two varieties
         “label” types and “value” types for “price” vs. “€120”
      Disambiguation through fixed precedence on annotation
      types
         e.g., field with value “lowest price first”
            both “price” and “order-by” annotation, but “order-by” <
            “price”




                                                                       28
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                             1

                                   2     A          4       A

                                                            A
                           B       3     A

                           C             B
                           Figure 6: Example Form Labeling
      N@A{ d, p, e }, “all nodes N having labels annotated with
      A” are either provided by human domain experts or derived from e
        d : direct (labels associated with the node itself) The current OPA
          ternal sources such as DBPedia and Freebase.
          version contains a large set of from field values) common doma
        p : proper (from labels, but not
                                           such artefacts for
          types such as price, location, or date.
         e : exclusive (labels considering precedence)
              D EFINITION 11. Given a form labeling F on a DOM P and a
                                                                   29
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                                1

                                    2       A         4      A

                                                             A
                           B        3       A

                           C                B
                               Figure 6: Example Form Labeling

           are either provided by human domain experts or derived from e
           ternal sources such as DBPedia and Freebase. The current OPA
           version contains a large set of such artefacts for common doma
           types such as price, location, or date.
              D EFINITION 11. Given a form labeling F on a DOM P and a
                                                                   30
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                                1

                           segment    2     A             4    A   value label

                                                               A
                           B          3     A
                                     field
                           C                B   proper label

                               Figure 6: Example Form Labeling

           are either provided by human domain experts or derived from e
           ternal sources such as DBPedia and Freebase. The current OPA
           version contains a large set of such artefacts for common doma
           types such as price, location, or date.
              D EFINITION 11. Given a form labeling F on a DOM P and a
                                                                   30
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                                1

                                    2       A         4      A

                                                             A
                           B        3       A

                           C                B
                               Figure 6: Example Form Labeling
      assuming B < A
          are either provided by human domain experts or derived from e
          ternal sources such as DBPedia and Freebase. The current OPA
          version contains a large set of such artefacts for common doma
          types such as price, location, or date.
              D EFINITION 11. Given a form labeling F on a DOM P and a
                                                                   30
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                                1

                                    2       A         4      A

                                                             A
                           B        3       A

                           C                B
                               Figure 6: Example Form Labeling
      assuming B < A
          are either provided by human domain experts or derived from e
      N@A{ } = 2, 3, 4
          ternal sources such as DBPedia and Freebase. The current OPA
          version contains a large set of such artefacts for common doma
          types such as price, location, or date.
              D EFINITION 11. Given a form labeling F on a DOM P and a
                                                                   30
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                                1

                                    2       A         4      A

                                                             A
                           B        3       A

                           C                B
                               Figure 6: Example Form Labeling
      assuming B < A
          are either provided by human domain experts or derived from e
      N@A{ } = 2, 3, 4
          ternal sources such as DBPedia and Freebase. The current OPA
      N@A{ d, e } contains a large set of such artefacts for common doma
          version = 2, 4
          types such as price, location, or date.
              D EFINITION 11. Given a form labeling F on a DOM P and a
                                                                   30
OPAL ›❯ Form Interpretation
3


    Annotation Queries
                                                1

                                    2       A         4      A

                                                             A
                           B        3       A

                           C                B
                               Figure 6: Example Form Labeling
      assuming B < A
          are either provided by human domain experts or derived from e
      N@A{ } = 2, 3, 4
          ternal sources such as DBPedia and Freebase. The current OPA
      N@A{ d, e } contains a large set of such artefacts for common doma
          version = 2, 4
      N@A{ p, esuch as price, location, or date.
          types } = 2, 3
             D EFINITION 11. Given a form labeling F on a DOM P and a 30
OPAL ›❯ Form Interpretation
3


     OPAL-TL Example
       Price range
          two successive fields in the same group
          at least one “price” type
     TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
 2        range connector in between
     TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} }
 4
     TEMPLATE concept_minmax<C,CM ,A> {
 6    concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
        N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
 8    concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ),
        concept<C>(N1 ),N2 @range_connector{e,d},¬(A1      A, N2 @A1 {d})
10    concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
        N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
12         _ (N1 @max{e,p},N2 @min{e,p})                                  31
OPAL ›❯ Form Interpretation
3


     OPAL-TL Example
       Price range
          two successive fields in the same group
          at least one “price” type
     TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
 2        range connector in between
     TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} }
 4
     TEMPLATE concept_minmax<C,CM ,A> {
 6    concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
        N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
 8    concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ),
        concept<C>(N1 ),N2 @range_connector{e,d},¬(A1      A, N2 @A1 {d})
10    concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),
        N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
12         _ (N1 @max{e,p},N2 @min{e,p})                                  31
OPAL ›❯ Form Interpretation
3


    OPAL Repair
      enforces structural constraints
         by repairing 4 types of violations


      under segmentation: missing segments
      over segmentation: superfluous segments
      under classification: missing types
      over classification: superfluous types




                                               32
OPAL ›❯ Evaluation
3


    OPAL: Evaluation

Form Labeling                          Form Interpretation
                                           Domain Scope
                        Layout Scope
              Segment Scope
     Field Scope




                                                          33
OPAL ›❯ Evaluation
4


    Datasets
      Domain-awareness Experiment
         UK real estate and used car domain
         100 web forms sampled for each domain
      Domain-independence Experiment—ICQ
         Web query interfaces in 5 domains
         100 query interfaces (20 in each domain)
      Domain-independence Experiment—Tel8
         Web query interfaces in 8 domains
         477 in total


                                                    34
OPAL ›❯ Evaluation
4


 Evaluation Summary

                               Precision      Recall       F-score
    1


0.985


 0.97


0.955


 0.94
         UK Real Estate (100) UK Used Car (100)        ICQ (98)      Tel-8 (436)


                                                                                   35
OPAL ›❯ Evaluation
4


    Scope Contribution

                          field         segment    layout   domain



Real-estate




    Used-car



               0.6               0.7             0.8          0.9   1

                                                                        36
results                                                                      (c)
   OPAL ›❯ Evaluation
4
AL comparison
    OPAL: Performance (incl. Browser)
                                                                               1
                                    200
       Total Computation time [s]




                                    150


                                    100                                      0.8

                                     50


                                      0
                                          0   2000   4000      6000   8000   0.6
                                                 Number of nodes              37
results                                                                       (c)
   OPAL ›❯ Evaluation
4
AL comparison
    OPAL: Performance (incl. Browser)
                                                                                1
                                    200
                                              >70% below 50s
       Total Computation time [s]




                                    150


                                    100                                       0.8

                                     50


                                      0
                                          0    2000   4000      6000   8000   0.6
                                                  Number of nodes              37
OPAL ›❯ Conclusion
4


    OPAL conclusion
      multi-scope domain independent form labeling
         combines visual, textual, and structural features
         expands form labeling with field, segment, page scope
      domain-aware form interpretation
         classifies form fields and repairs domain form model
         disambiguates form classification (up to 10% gain in
         precision)
      template language OPAL-TL
         provides a library of ubiquitous form patterns
         significantly speeds up domain instantiation

                                                                38
OPAL ›❯ Conclusion
4


Future work
     Integrate form understanding & record extraction (e.g.,
     AMBER)
        “Site scope”


     Semi-automatic ontology creation for new domains
        tailor ontology learning approaches to forms


     Integrate form constraint extraction
        from probing and Javascript analysis (ProFound demo)



                                                               39
DIADEM ›❯ OPAL
                                                                                                             OPAL: A Passe-partout for Web Forms
DIADEM                                              domain-centric intelligent automated
                                                    data extraction methodology
                                                                                                                                                  Authors

                                                                                                                                                      Xiaonan Guo, Jochen Kranzdorf,
                                                                                                                                                                                                                    Digital Home

                                                                                                                                                                                                                    diadem.cs.ox.ac.uk/opal/
                                                                                                                                                                                                                                                                                                                            Sponsors



                                                                                                                                                         Tim Furche, Giovanni Grasso,
                                                                                                                                                      Giorgio Orsi, Christian Schallhart
                                                                                                                                                                                                                    diadem-opal@cs.ox.ac.uk
                                                                                                                                                                                                                                                                                                                                                          b-node

                                                                            Segment Scope
                                                        Segmentation: Find “logical” structure of the form                                                              OPAL: Automated Form Understanding for the Deep Web                                                                                  OPAL: A Passe-partout for the Web
                                                        — segmentation tree with only fields as leaves and                                                               OPAL combines multi-scope domain-independent analysis with domain knowledge                                                         Automatic form filling based on domain-specific master form
                                                          • form segments s as inner nodes such that s has                                                    T4         — multi-scope domain-independent analysis: field, segment, and layout scope                                                         — automatic (approximate) matching of master form values to values of concrete fields
                                                          • at least degree two and all fields in s are style-equivalent
                                                                                                                      T2                                                 — integration of three scopes yields more robust, simpler heuristics than single scope                                              — visualization of form and segment concepts

                                                                    4               — segmentation labeling: distribute labels F2                                        — strict preference for disambiguation for quality and performance reasons                                                          — automatically detects forms of the given domain and fills them
                                                                                      • to fields, if there is a regular structure                                       Domain knowledge on top of the domain-independent analysis for                                                                      — works on nearly any page of the domain
                                                                                                                                   T1                   NW    N         NE
                                    2                   4           5                 • to segments, if single, prefix label                                             — classifying form fields and segments according to the domain ontology
                                                                                                                           T3 nF3 n’                     W   F1         E— verifying and repairing the form model to be consistent with domain constraints                                                   OPAL outperforms previousapproaches even without domain knowledge
                                                                                          — style-equivalence: two nodes and
                                        2           5       6           7       8   8                                                                                   SE Datalog-based template language for easy definition of domain knowledge
                                                                                                                                                                         —                                                                                                                                   — with domain knowledge we achieve nearly perfect accuracy in multiple domains
                        1     1             3                               7                • if same class or same type and CSS style                 SW     S
                   1                            3               6                                                        Figure 5: Layout Scope
                                                                                             — complexity: linear in document size and depth                 Labeling
                   Figure 4: Example for Segment Scope Labeling
                                                                            DOM tree                      w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or
                                                                                                                                               Segment tree                                                   Layout tree                                                                          Schema tree
       ns as representative for s (f (s) = ns ). For each segment with reg-                               (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an-
       ular interleaving of text nodes and field or segment nodes, we use                                  other field with ne-e(t    0 , f 0 ) and w-nw-n(t 0 , f ).
       those text nodes as labels for these nodes, preserving any already




                                                            
 
Thank You
       assigned labels and fields (from field scope). In detail, we iterate                                 Fields & Labels and T are overshadowedthe example in Fig-, & Labels
                                                                                                            To illustrate this overshadowing, consider
                                                                                                          ure 5. For field F , T
                                                                                                                                                             Segments
                                                                                                                                                       by F and T by F
                                                                                                                                                                                                                                             Visual Labels                                                                                              Form Model
       over all descendants c of each segment in document order, skip-                                                            1   2       4                          2        3      3
       ping any nodes that are descendants of another segment or field                                     only T1 is not overshadowed, as there is no other text node that is
       itself contained in n (line 13). In the iteration, we collect all field or                          south-east or south from T3 not overshadowed by another field.
       segment nodes in Nodes, and all sets of text nodes between field or                                    The layout scope labeling is then produced as follows: For each
                                                                 Field Scope
       segment nodes in Labels, except those text nodes already assigned                                                                 Segment Scope
                                                                                                          field f , we collect all text nodes t with w-nw-n(t, f ) and add them                                   Layout Scope                                                                         Domain Scope
       as labels in field scope (line 14), as we assume that these are outliers                            as labels to f if they are not overshadowed by another field and not
       in the regular structure of the segment. We assign the i-th text node                              contained in a segment that is no ancestor of f . The latter prevents
       group to the i-th field, if the two lists have the same size (possibly                              assignment of labels from unrelated form segments.
       using the first text node as labels of the segment, line 17–19).
          Figure 4 illustrates the segment scope labeling with triangles                                  4. FORM INTERPRETATION
       denoting Visual heuristics: Find labels inblack circles segments, and
                  text nodes, diamonds fields, visual proximity of a field
                                                                                                             There is no straightforward relationship between form fields for
                                                                                                                                 T4
       white circles DOM “overshadowed” segment tree. The numbers in-
                 — but: not nodes not in the by another field
                                                                                                        T2
                                                                                                          domain concepts, such as location or price, and their structure within
       dicate which text nodes are assigned as labels tofield in reading order
                 — but: preference for labels preceding a which segments or                                    F2
                                                                                                          a form. Even seemingly domain-independent concepts, such as
       fields. E.g., for the left hand segment, we observe a regular struc-
                                                                                                          price,Toften exhibit domain NE specific peculiarities, such as “guide
       ture of (text node+, field)+ and thuspwe assign the i-th group of
                        — t visible from a point if north-west of p                                               1      NW      N
                                                                                                          price”, “current offers in excess”, or payment periods in real es-
       text nodes to the i-th field. For the for t if hand segment (4), we find T3
                        — f overshadows f’ right t visible from                                            F3              W F1         E
                                                                                                          tate. OPAL’s domain schemata allow us to cover these specifics.
       a subsegment (5) • bottom-right corner of f and f, f’ unaligned node
                           and field 8 that is already labeled with text
                                                                                                          We recall from Section 2 that a form model (F 0 , t) for a schema S
                                                                                                                         SW      S      SE
       8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned
                                 bottom-left corner of and one text node re-
                                                                                                          is derived from a form labeling F by extending F with types and
                                                                                                          restructuring its inner nodes to fit the structuralFilling: Aof S.
                                                                                                                                                   Form constraints Passe-partout for the Web                                                   Experiments                                                                                                Applications of OPAL
       mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f
                                  •    bottom-right corner label. aligned find
       one more text node group than fields and visualconsider the first text
                                       has no other thus label
                                                                                                                                                                                                                                                                                                                                                            form filling
                                                                                                             OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout)
                                                                                                                                                    Master form serves in two                                                                                                                                                                                                     Search
       node group as a segment label. The remaining nodes have a regular                                                                                                                                                                                 1
                                                                                                          steps: (1) the classification of nodes in — according free text values used directly for free text inputs
                                                                                                                                                     F currently: to the domain                                                                                                                                                                             —‘key’ to the web                    Data Extraction
       structure (field, text node+)+ and get assigned accordingly.                                                                                                                                                                                0.985

                                                                                Layout Scope              types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes
                                                                                                                                                    — step relies on the anno-                                                                                                                                                                              — all types of web

       3.3 Layout Scope                                                                                   tation schema L and its typing of labels in F; (2) the model repair                                                                         0.97                                                                                                     automation need it
                                                                                                          where the segmentation structure derived in the segmentation scope
       At layout scope, we further refine the form labeling for each                                                                                ➊ URL
                                                                                                          (Section 3.2) is aligned with the structure constraints of S.
                                                                                                                                                                                         ➊                                                        0.955


    form field not yet labelled in field or segment scope, by exploring                                                                                                                                                                           (a)0.94
    the visible text nodes in the west, north-west, or north quadrant,
                                                                                                                                ➋ Visualization                                                                       ➋                                      UK Real Estate (100) UK Used Car (100)                  ICQ (98)             Tel-8 (436)
                                                                                                                                                                                                                                                                                                                                                             Assistive
Domain Scope: OPAL-TL by any other field. To this end, OPAL                                                4.1 Schema Design: OPAL-TL
                                                                                                                                  Controller                                                                                                    (b)                                                                                                          Devices
    if they are not overshadowed                                                                                              1                                     TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} }
                                                                                                              OPAL provides a template language, OPAL - TL, for easily speci-                                                                           1
OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes:
    constructs layout tree from the CSS box labels of queries
                                                                                                                  domain schemata reusing common Web page and their con- segment<C,A> { ➌
                                                                                                                                                                                                                                                (c)
                                                                                                                                                                      2

                                                                                                          fying 2          A           4        A        ➌ concepts      TEMPLATE concept_by_                    concept<C>(N)( N@A{e,p} }            0.98
— templates = common9. The layout forms, e.g.,given DOM P is a tuple
            D EFINITION          pattern in web tree of a range specification
                                                                                                          straints as well as concept templates. To implement aconcept_minmax<C,C ,A> {
                                                                                                                                                A
                                                                                                                                                                      4
                                                                                                                                                                         TEMPLATE
                                                                                                                                                                                      new do-                                                   (d)
                                                                                                                                                                                                                                                  0.96                                       (e)
— rules = , , w, nw, nfor domain-specific element andN is thetypes DOM
        (NP constraints , ne, e, se, s, sw, aligned) where segment set of                                B        3        A                                                                            M
                                                                     P                                    main, we only need to provide (1) a set of annotators implementing
                                                                                                                                                                      6 concept<CM >(N1 )  ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),             0.94
                                                                                                                                                                                                                                                (f)
        nodes from P, , w, nw, n, . . . the “belongs to” (containment), west,
   TEMPLATE segment<C>{                                                                                  CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do-
                                                                                                                           B                             - TL specification   N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d})
                                                                                                                                                                                                                 ➍                                    0.92
                                                                                                                                                                                                  Ⓐ
                                                                                                                                                                      8 concept<CM >(N (
 2
        north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d})
     segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G),
                                                                            Repairing form                main types and their classification
                                                                                                           Figure 6: Example Form Labeling
                                                                                                                                                                                                                                   ,N1 ),
       ¬(concept<C>(N2 ) _ segment<C>(N2 )) }                                                                                                                Ⓐ Field concept<C>(N              2 @range_connector{e,d},¬(A1     A,                     0.9    (a) web page                                                      (b) page scope
        holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun-
                                                                            form interpretation F, that labels annotated with certain type templates and predefined predi-
                                                                                                              OPAL -   not necessarily a model with                  10 concept<CM >(N1 )
                                                                                                                                                                                                  Ⓑ
                                                                                                                                                                                           ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ),                        Airfare              Auto              Book                Job             US R.E.

                                                                                     and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p})
                                                                                                                                                             Ⓑ
 4
                                                                            der S,direct?either provided by human domain experts or We
                                                                                         are look for violations ofconvenient querying of annotations and DOM nodes. An                                                                                                                                                                                                     Multi-modal Input
   TEMPLATE segment_range<C,CM > {                                            •                           cates for structural constraints.                                  N1
     segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p})
            We call w, nw, . . . the neighbour relations. The adapt tree is of           types sources such as DBPedia and Freebase. The
                                                                                                                                                                                                                                                        Figure 4: Holbrook Moran form and page-scope labeling
                                                                                                                                                                                                                                                                                                                                                                            (Touch-Screens)
 6                                                                                                                                                         filled     12         _    1             2                                                                              field             segment         layout        domain
                                                                              • proper? look for labels (‘price’) or also values (‘£10’)
        most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM
                                                                             be computed majority describedprogramconstruct a form
                                                                                          the rewriting rules large set of such is executed against a form
                                                                                                          OPAL - TL
       N1 6= N2 ,child(N1 ,G),child(N2 ,G) }
                                                                                                                                                           according to master
 8
                                                                            modelexclusive? as price,Relationstheof the type P strati-
                                                                                     compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates
                                                                                                          P. performsorfrom   rewriting in
   TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the
        in segment_with_unique<C,U> {                                                    types such                                                        form                       Figure 7:                                                   1           Real-estate                                    1                                                                                  Web Automation
                                                                            fied manner to guarantee termination only use child (descendant, resp.) for the child (descen-
                                                                                                                     and introduces at most n new
                                                                                            D EFINITION TL. We ain the form.
     segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G),         — template: limited second-order, same data complexity                                                                                                                                                                                                                                                                 & Testing
10
        union of the relations w, nw, and n.                                segments where n is the number  11. of fields form labeling F on a DOM P and an
                                                                                                                 Given
       ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . }                                                                                   ➎ Master formsibling order                                                             0.99

            In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con-
                                                                              • template variablesIfquantify over predicates or We is an expres-
                                                                               (1) Under Segmentation: L, resp.) relation intype such
                                                                                         annotation       dant, an a segment n with F. types                               As                                                                                                                 Precision                                        domain
12
                                                                                                                                                           provides passe-                                                                      0.98
                                                                            that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N
                                                                                                                                                                        straints that associate                                                                 Used-car
   TEMPLATE outlier<C>{                                                       • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if
                                                                                                          from P segments
                                                                                         sion of the form: X@A{d, p, e} of type t1 . ,tk 62                                                                                                                                                   Recall                                           layout
        preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions
                                                  0 ,P),¬(segment<C>(G0 from a field. How-
                                                                                                                                                                                                                                                                                                        0.8
                                                                                         A 2 A, and d, P and e are annotation modifiers. An annotation for filling
                                                                                                                                                           partout      is labeled by an exclusive direct and proper annotation of type A.                                                    F-score                                          segment
                                                                                                           p,
        ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or-
14   outlier<C>(G)( root(G)_child(G,P),child(G                             INSTANTIATE basic_concept<D,A> using node in        {<RADIUS, occurs
                                                                                                                                            radius>}                                                                                            0.97
                                                                                                                                                                                                                                                                            0.6              0.7             0.8            0.9              1 field
                                                                            P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms
                                                                                         Pn such              µ ✓ and Pe} (X,Y ), |=
        segment labels. Thusstructural constraints
                 Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as
                                                                            For each Pi we add
                                                                                                          der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa.
                                                                                                                                           Rnext-sibling                 TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} }             0.96
                                                                                                                                                                                                       Figure 3: OPAL Interface
                                                                                               @Aµ nodes 2 P : Allowµ (n)  Matchµ labell 0} (X)) µ (A)
        Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f  Blockas l(X).
                                                                                                          Finally, to from n to that segment. /                                                                                                 0.95                                                     0.6
                                                                            with ti , and move
                                                                                In practice, few cases of multiple under segmentations occur at the                     A template tpl is instantiated to produce a family of rules where                Real estate
                                                                                                                                                                                                                                                         Real-estate       Used car
                                                                                                                                                                                                                                                                           Used-car                                Real-estate Used-car
 first two templates). It is0 the only template 0with two concept tem-                            0
OPAL: automated form understanding for the deep web - WWW 2012
OPAL: automated form understanding for the deep web - WWW 2012

Más contenido relacionado

Más de Giorgio Orsi

Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Giorgio Orsi
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionGiorgio Orsi
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_finalGiorgio Orsi
 
Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesGiorgio Orsi
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014Giorgio Orsi
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsGiorgio Orsi
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 PosterGiorgio Orsi
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)Giorgio Orsi
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Giorgio Orsi
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem OntologyGiorgio Orsi
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentationGiorgio Orsi
 

Más de Giorgio Orsi (20)

Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)Fairhair.ai – alan turing institute june '17 (public)
Fairhair.ai – alan turing institute june '17 (public)
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
diadem-vldb-2015
diadem-vldb-2015diadem-vldb-2015
diadem-vldb-2015
 
wadar_poster_final
wadar_poster_finalwadar_poster_final
wadar_poster_final
 
Query Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological DatabasesQuery Rewriting and Optimization for Ontological Databases
Query Rewriting and Optimization for Ontological Databases
 
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014
 
Perv a ds-rr13
Perv a ds-rr13Perv a ds-rr13
Perv a ds-rr13
 
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsHeuristic Ranking in Tightly Coupled Probabilistic Description Logics
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
 
AMBER WWW 2012 Poster
AMBER WWW 2012 PosterAMBER WWW 2012 Poster
AMBER WWW 2012 Poster
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
 
Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...Nyaya: Semantic data markets: a flexible environment for knowledge management...
Nyaya: Semantic data markets: a flexible environment for knowledge management...
 
Table Recognition
Table RecognitionTable Recognition
Table Recognition
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
Oxpath vldb
Oxpath vldbOxpath vldb
Oxpath vldb
 
Gottlob ICDE 2011
Gottlob ICDE 2011Gottlob ICDE 2011
Gottlob ICDE 2011
 
OPAL Presentation
OPAL PresentationOPAL Presentation
OPAL Presentation
 
AMBER presentation
AMBER presentationAMBER presentation
AMBER presentation
 
Orsi PersDB11
Orsi PersDB11Orsi PersDB11
Orsi PersDB11
 
Orsi Vldb11
Orsi Vldb11Orsi Vldb11
Orsi Vldb11
 

Último

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

OPAL: automated form understanding for the deep web - WWW 2012

  • 1. DIADEM domain-centric intelligent automated data extraction methodology OPAL: Automated Form Understanding for the Deep Web Xiaonan Guo DIADEM Group, Department of Computer Science, Oxford University joint work with Tim Furche, Giovanni Grasso, Jochen Kranzdorf, Giorgio Orsi, Christian Schallhart
  • 2. OPAL ›❯ Scenario 1 A scenario... Looking for a house ? Too many websites to check ? Tired of searching through dozen of websites? 2
  • 3. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 4. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 5. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 6. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 7. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 8. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 9. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 10. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 11. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 12. OPAL ›❯ Scenario 1 The Diversity of Forms 3
  • 13. OPAL ›❯ Scenario 1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation 4
  • 14. OPAL ›❯ Scenario 1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Field Scope 4
  • 15. OPAL ›❯ Scenario 1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Segment Scope Field Scope 4
  • 16. OPAL ›❯ Scenario 1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Layout Scope Segment Scope Field Scope 4
  • 17. OPAL ›❯ Scenario 1 Automatic Form Understanding Form Understanding := Form Labeling + Form Form Labeling Form Interpretation Layout Scope Domain Scope Segment Scope Field Scope 4
  • 18. OPAL ›❯ Scenario 1 Previous Approaches Form Labeling Layout Scope Segment Scope Field Scope Only subset of scopes for form labeling Only form labeling, no or little form interpretation Where domain-specific, expensive to switch domain 5
  • 19. OPAL ›❯ Scenario 1 Previous Approaches Form Labeling Dragut et al., VLDB’09 Layout Scope Segment Scope Khare et al, CIKM’09 Field Scope Nguyen et al, VLDB’08 Only subset of scopes for form labeling Only form labeling, no or little form interpretation Where domain-specific, expensive to switch domain 5
  • 20. OPAL ›❯ Scenario 1 OPAL OPAL = multi-scope form labeling + flexible form interpretation 1. form labeling with all three scopes more accurate, more robust, less complex 2. form interpretation = matching and repair for elements to concepts 3. template language and pattern library for defining form ontologies more succinct, more reuse, less work 6
  • 21. OPAL ›❯ Scenario 1 OPAL OPAL = multi-scope form labeling + flexible form interpretation 1. form labeling with all three scopes more accurate, more robust, less complex 2. form interpretation = matching and repair for elements to concepts 3. template language and pattern library for defining form 1 ontologies 0.985 more succinct, more reuse, less work >95% 0.97 0.955 0.94 6 UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436)
  • 22. OPAL ›❯ Form Labeling 2 Form Labeling Form Labeling Form Interpretation Domain Scope Layout Scope Segment Scope Field Scope 7
  • 23. OPAL ›❯ Form Labeling 2 Form Labeling Form Labeling Layout Scope Segment Scope Field Scope 7
  • 24. OPAL ›❯ Form Labeling 2 Form Labeling Form Labeling Field Scope 7
  • 25. OPAL ›❯ Form Labeling 2 Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm 8
  • 26. OPAL ›❯ Form Labeling 2 Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm 8
  • 27. OPAL ›❯ Form Labeling 2 Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm 8
  • 28. OPAL ›❯ Form Labeling 2 Field Scope simple heuristic to individual fields assigns text nodes t to field f if the least common ancestor of t and f is ancestor of no other field linear coloring algorithm field segment layout domain Real-estate Used-car 8 0.6 0.7 0.8 0.9 1
  • 29. OPAL ›❯ Form Labeling 2 Segment Scope Form Labeling Field Scope 9
  • 30. OPAL ›❯ Form Labeling 2 Segment Scope Form Labeling Segment Scope Field Scope 9
  • 31. OPAL ›❯ Form Labeling 2 Segment Scope builds a segment tree groups semantically likely related form elements, e.g. fields, labels assigns labels to fields and groups form fields are grouped if they occur in sequence they have similar style their least common ancestor contains no other elements 10
  • 32. OPAL ›❯ Form Labeling 2 Segment Scope builds a segment tree groups semantically likely related form elements, e.g. fields, labels assigns labels to fields and groups form fields are grouped if they occur in sequence they have similar style their least common ancestor contains no other elements 10
  • 33. OPAL ›❯ Form Labeling 2 Segment Tree 1 2 3 4 5 DOM Tree Segment Tree Figure 3: Example DOM and Segment Tree remove layout or template artifacts artificially breaking sequences of style-equivalent fields Algorithm 3: SegmentScopeLabeling(DOM P, Form Labeling F) 1 S SegmentTree(P) ; 11
  • 34. 5 2 5 6 7 7 1 1 2 3 6 1 3 5 1 1 1 2 2 3 3 5 6 6 7 7 12
  • 35. OPAL ›❯ Form Labeling 2 Field & Segment Scope: Example 13
  • 36. OPAL ›❯ Form Labeling 2 Layout Scope Form Labeling Segment Scope Field Scope 14
  • 37. OPAL ›❯ Form Labeling 2 Layout Scope Form Labeling Layout Scope Segment Scope Field Scope 14
  • 38. OPAL ›❯ Form Labeling 2 Layout Scope Aligns visually related texts and fields prefers texts in the w-nw-n direction considers overshadowing T4 T2 F2 T1 NW N NE T3 F3 W F1 E SW S SE 15
  • 39. OPAL ›❯ Form Labeling 2 Layout Scope Example 16
  • 40. OPAL ›❯ Form Labeling 2 Layout Scope Example 16
  • 41. OPAL ›❯ Form Labeling 2 Layout Scope Example 16
  • 42. OPAL ›❯ Form Labeling 3 Form Labeling Form Labeling Layout Scope Segment Scope Field Scope 17
  • 43. OPAL ›❯ Form Labeling 2 Evaluation: ICQ & Tel8 Benchmark Domain-independence Experiment—ICQ Web query interfaces in 5 domains 100 query interfaces (20 in each domain) Domain-independence Experiment—Tel8 Web query interfaces in 8 domains 477 in total 18
  • 44. OPAL ›❯ Form Labeling 2 ICQ by Domain 1 0.98 0.96 0.94 0.92 0.9 Airfare Auto Book Job US R.E. 19
  • 45. OPAL ›❯ Form Labeling 2 ICQ by Domain 1 0.98 >98% 0.96 0.94 0.92 0.9 Airfare Auto Book Job US R.E. 19
  • 46. OPAL ›❯ Form Labeling 2 ICQ by Domain 1 0.98 >98% 0.96 0.94 Dragut et al., VLDB’09 0.92 0.9 Airfare Auto Book Job US R.E. 19
  • 47. OPAL ›❯ Form Labeling 2 Tel-8 by Domain 0.99 0.96 0.93 0.9 Airfares Automobiles Books Car Rentals Hotels Jobs Movies Records 20
  • 48. OPAL ›❯ Form Labeling 2 Tel-8 by Domain >95% 0.99 0.96 0.93 0.9 Airfares Automobiles Books Car Rentals Hotels Jobs Movies Records 20
  • 49. OPAL ›❯ Form Interpretation 3 Form Interpretation Form Labeling Layout Scope Segment Scope Field Scope 21
  • 50. OPAL ›❯ Form Interpretation 3 Form Interpretation Form Labeling Form Interpretation Domain Scope Layout Scope Segment Scope Field Scope 21
  • 51. OPAL ›❯ Form Interpretation 3 Form Interpretation 22
  • 52. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
  • 53. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
  • 54. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
  • 55. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
  • 56. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Element(max) ... 22
  • 57. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ... 22
  • 58. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element AreaBranch Element ... Price Element(min) Price Price Element(max) Segment ... 22
  • 59. OPAL ›❯ Form Interpretation 3 Form Interpretation ... AreaBranch Element AreaBranch Element AreaBranch Element Area-Branch Geographic Segment Segment AreaBranch Element Real- AreaBranch Element Estate ... Form Price Element(min) Price Price Element(max) Segment ... 22
  • 60. OPAL ›❯ Form Interpretation 3 Form Interpretation Form interpretation in OPAL = annotation + classification + repair Annotation: straightforward entity recognition Classification: maps fields to concepts based on annotations Repair: enforces structural constraints Together: high precision disambiguation for form classification 23
  • 61. OPAL ›❯ Form Interpretation 3 OPAL Ontology What does on OPAL ontology consist of? Annotation Annotation types “number”, “price period” (p.c.m.) Classification Concepts “price field”, “location field” Repair Constraints “R.E. form must have a location field” constraints: classification: between annotations and concepts structural: between concepts 24
  • 62. OPAL ›❯ Form Interpretation 3 OPAL Ontology Reducing the effort in designing an OPAL ontology by exploiting ubiquitous patterns of web forms e.g., “range” over a numeric type, dependent types Halevy et al., VLDB’08 identify 7 such patterns but: these patterns must be instantiated for a specific domain e.g., “price range”, “radius” linked with “location”, “postcode” with “city”, … OPAL’s solution: templates for these patterns 25
  • 63. OPAL ›❯ Form Interpretation 3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
  • 64. OPAL ›❯ Form Interpretation 3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
  • 65. OPAL ›❯ Form Interpretation 3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
  • 66. OPAL ›❯ Form Interpretation 3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
  • 67. OPAL ›❯ Form Interpretation 3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
  • 68. OPAL ›❯ Form Interpretation 3 Form Patterns Example Small set of ubiquitous patterns ranges, dates, options, etc. Ontology by instantiation 26
  • 69. OPAL ›❯ Form Interpretation 3 OPAL-TL Patterns are specified using OPAL-TL OPAL-TL = Datalog + Annotation Queries + Templates 27
  • 70. OPAL ›❯ Form Interpretation 3 Annotation Queries Annotation types come in two varieties “label” types and “value” types for “price” vs. “€120” Disambiguation through fixed precedence on annotation types e.g., field with value “lowest price first” both “price” and “order-by” annotation, but “order-by” < “price” 28
  • 71. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling N@A{ d, p, e }, “all nodes N having labels annotated with A” are either provided by human domain experts or derived from e d : direct (labels associated with the node itself) The current OPA ternal sources such as DBPedia and Freebase. version contains a large set of from field values) common doma p : proper (from labels, but not such artefacts for types such as price, location, or date. e : exclusive (labels considering precedence) D EFINITION 11. Given a form labeling F on a DOM P and a 29
  • 72. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling are either provided by human domain experts or derived from e ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
  • 73. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 segment 2 A 4 A value label A B 3 A field C B proper label Figure 6: Example Form Labeling are either provided by human domain experts or derived from e ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
  • 74. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
  • 75. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e N@A{ } = 2, 3, 4 ternal sources such as DBPedia and Freebase. The current OPA version contains a large set of such artefacts for common doma types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
  • 76. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e N@A{ } = 2, 3, 4 ternal sources such as DBPedia and Freebase. The current OPA N@A{ d, e } contains a large set of such artefacts for common doma version = 2, 4 types such as price, location, or date. D EFINITION 11. Given a form labeling F on a DOM P and a 30
  • 77. OPAL ›❯ Form Interpretation 3 Annotation Queries 1 2 A 4 A A B 3 A C B Figure 6: Example Form Labeling assuming B < A are either provided by human domain experts or derived from e N@A{ } = 2, 3, 4 ternal sources such as DBPedia and Freebase. The current OPA N@A{ d, e } contains a large set of such artefacts for common doma version = 2, 4 N@A{ p, esuch as price, location, or date. types } = 2, 3 D EFINITION 11. Given a form labeling F on a DOM P and a 30
  • 78. OPAL ›❯ Form Interpretation 3 OPAL-TL Example Price range two successive fields in the same group at least one “price” type TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } 2 range connector in between TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} } 4 TEMPLATE concept_minmax<C,CM ,A> { 6 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) 8 concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ), concept<C>(N1 ),N2 @range_connector{e,d},¬(A1 A, N2 @A1 {d}) 10 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) 12 _ (N1 @max{e,p},N2 @min{e,p}) 31
  • 79. OPAL ›❯ Form Interpretation 3 OPAL-TL Example Price range two successive fields in the same group at least one “price” type TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } 2 range connector in between TEMPLATE concept_by_segment<C,A> { concept<C>(N)( N@A{e,p} } 4 TEMPLATE concept_minmax<C,CM ,A> { 6 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) 8 concept<CM >(N2 )( child(N1 ,G),child(N2 ,G),follows(N2 ,N1 ), concept<C>(N1 ),N2 @range_connector{e,d},¬(A1 A, N2 @A1 {d}) 10 concept<CM >(N1 )( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), N1 @A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) 12 _ (N1 @max{e,p},N2 @min{e,p}) 31
  • 80. OPAL ›❯ Form Interpretation 3 OPAL Repair enforces structural constraints by repairing 4 types of violations under segmentation: missing segments over segmentation: superfluous segments under classification: missing types over classification: superfluous types 32
  • 81. OPAL ›❯ Evaluation 3 OPAL: Evaluation Form Labeling Form Interpretation Domain Scope Layout Scope Segment Scope Field Scope 33
  • 82. OPAL ›❯ Evaluation 4 Datasets Domain-awareness Experiment UK real estate and used car domain 100 web forms sampled for each domain Domain-independence Experiment—ICQ Web query interfaces in 5 domains 100 query interfaces (20 in each domain) Domain-independence Experiment—Tel8 Web query interfaces in 8 domains 477 in total 34
  • 83. OPAL ›❯ Evaluation 4 Evaluation Summary Precision Recall F-score 1 0.985 0.97 0.955 0.94 UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) 35
  • 84. OPAL ›❯ Evaluation 4 Scope Contribution field segment layout domain Real-estate Used-car 0.6 0.7 0.8 0.9 1 36
  • 85. results (c) OPAL ›❯ Evaluation 4 AL comparison OPAL: Performance (incl. Browser) 1 200 Total Computation time [s] 150 100 0.8 50 0 0 2000 4000 6000 8000 0.6 Number of nodes 37
  • 86. results (c) OPAL ›❯ Evaluation 4 AL comparison OPAL: Performance (incl. Browser) 1 200 >70% below 50s Total Computation time [s] 150 100 0.8 50 0 0 2000 4000 6000 8000 0.6 Number of nodes 37
  • 87. OPAL ›❯ Conclusion 4 OPAL conclusion multi-scope domain independent form labeling combines visual, textual, and structural features expands form labeling with field, segment, page scope domain-aware form interpretation classifies form fields and repairs domain form model disambiguates form classification (up to 10% gain in precision) template language OPAL-TL provides a library of ubiquitous form patterns significantly speeds up domain instantiation 38
  • 88. OPAL ›❯ Conclusion 4 Future work Integrate form understanding & record extraction (e.g., AMBER) “Site scope” Semi-automatic ontology creation for new domains tailor ontology learning approaches to forms Integrate form constraint extraction from probing and Javascript analysis (ProFound demo) 39
  • 89. DIADEM ›❯ OPAL OPAL: A Passe-partout for Web Forms DIADEM domain-centric intelligent automated data extraction methodology Authors Xiaonan Guo, Jochen Kranzdorf, Digital Home diadem.cs.ox.ac.uk/opal/ Sponsors Tim Furche, Giovanni Grasso, Giorgio Orsi, Christian Schallhart diadem-opal@cs.ox.ac.uk b-node Segment Scope Segmentation: Find “logical” structure of the form OPAL: Automated Form Understanding for the Deep Web OPAL: A Passe-partout for the Web — segmentation tree with only fields as leaves and OPAL combines multi-scope domain-independent analysis with domain knowledge Automatic form filling based on domain-specific master form • form segments s as inner nodes such that s has T4 — multi-scope domain-independent analysis: field, segment, and layout scope — automatic (approximate) matching of master form values to values of concrete fields • at least degree two and all fields in s are style-equivalent T2 — integration of three scopes yields more robust, simpler heuristics than single scope — visualization of form and segment concepts 4 — segmentation labeling: distribute labels F2 — strict preference for disambiguation for quality and performance reasons — automatically detects forms of the given domain and fills them • to fields, if there is a regular structure Domain knowledge on top of the domain-independent analysis for — works on nearly any page of the domain T1 NW N NE 2 4 5 • to segments, if single, prefix label — classifying form fields and segments according to the domain ontology T3 nF3 n’ W F1 E— verifying and repairing the form model to be consistent with domain constraints OPAL outperforms previousapproaches even without domain knowledge — style-equivalence: two nodes and 2 5 6 7 8 8 SE Datalog-based template language for easy definition of domain knowledge — — with domain knowledge we achieve nearly perfect accuracy in multiple domains 1 1 3 7 • if same class or same type and CSS style SW S 1 3 6 Figure 5: Layout Scope — complexity: linear in document size and depth Labeling Figure 4: Example for Segment Scope Labeling DOM tree w-nw-n-ne-e(t, f 0 ) or (2) f and f 0 are aligned and (i) w(t, f 0 ) or Segment tree Layout tree Schema tree ns as representative for s (f (s) = ns ). For each segment with reg- (ii) nw-n(t, f 0 ) and there is a text node t 0 not overshadowed by an- ular interleaving of text nodes and field or segment nodes, we use other field with ne-e(t 0 , f 0 ) and w-nw-n(t 0 , f ). those text nodes as labels for these nodes, preserving any already Thank You assigned labels and fields (from field scope). In detail, we iterate Fields & Labels and T are overshadowedthe example in Fig-, & Labels To illustrate this overshadowing, consider ure 5. For field F , T Segments by F and T by F Visual Labels Form Model over all descendants c of each segment in document order, skip- 1 2 4 2 3 3 ping any nodes that are descendants of another segment or field only T1 is not overshadowed, as there is no other text node that is itself contained in n (line 13). In the iteration, we collect all field or south-east or south from T3 not overshadowed by another field. segment nodes in Nodes, and all sets of text nodes between field or The layout scope labeling is then produced as follows: For each Field Scope segment nodes in Labels, except those text nodes already assigned Segment Scope field f , we collect all text nodes t with w-nw-n(t, f ) and add them Layout Scope Domain Scope as labels in field scope (line 14), as we assume that these are outliers as labels to f if they are not overshadowed by another field and not in the regular structure of the segment. We assign the i-th text node contained in a segment that is no ancestor of f . The latter prevents group to the i-th field, if the two lists have the same size (possibly assignment of labels from unrelated form segments. using the first text node as labels of the segment, line 17–19). Figure 4 illustrates the segment scope labeling with triangles 4. FORM INTERPRETATION denoting Visual heuristics: Find labels inblack circles segments, and text nodes, diamonds fields, visual proximity of a field There is no straightforward relationship between form fields for T4 white circles DOM “overshadowed” segment tree. The numbers in- — but: not nodes not in the by another field T2 domain concepts, such as location or price, and their structure within dicate which text nodes are assigned as labels tofield in reading order — but: preference for labels preceding a which segments or F2 a form. Even seemingly domain-independent concepts, such as fields. E.g., for the left hand segment, we observe a regular struc- price,Toften exhibit domain NE specific peculiarities, such as “guide ture of (text node+, field)+ and thuspwe assign the i-th group of — t visible from a point if north-west of p 1 NW N price”, “current offers in excess”, or payment periods in real es- text nodes to the i-th field. For the for t if hand segment (4), we find T3 — f overshadows f’ right t visible from F3 W F1 E tate. OPAL’s domain schemata allow us to cover these specifics. a subsegment (5) • bottom-right corner of f and f, f’ unaligned node and field 8 that is already labeled with text We recall from Section 2 that a form model (F 0 , t) for a schema S SW S SE 8 in the field scope. •Thus 8 is ignored and fonly f, f’ aligned bottom-left corner of and one text node re- is derived from a form labeling F by extending F with types and restructuring its inner nodes to fit the structuralFilling: Aof S. Form constraints Passe-partout for the Web Experiments Applications of OPAL mains directly in 4, which becomes the segmentof f, f, f'In 5, weand f • bottom-right corner label. aligned find one more text node group than fields and visualconsider the first text has no other thus label form filling OPAL performs form interpretation of a form labeling Fas template for filling individual forms (passe-partout) Master form serves in two Search node group as a segment label. The remaining nodes have a regular 1 steps: (1) the classification of nodes in — according free text values used directly for free text inputs F currently: to the domain —‘key’ to the web Data Extraction structure (field, text node+)+ and get assigned accordingly. 0.985 Layout Scope types T to obtain a partial typing tP . Thisbut using approximate matching for filling select boxes — step relies on the anno- — all types of web 3.3 Layout Scope tation schema L and its typing of labels in F; (2) the model repair 0.97 automation need it where the segmentation structure derived in the segmentation scope At layout scope, we further refine the form labeling for each ➊ URL (Section 3.2) is aligned with the structure constraints of S. ➊ 0.955 form field not yet labelled in field or segment scope, by exploring (a)0.94 the visible text nodes in the west, north-west, or north quadrant, ➋ Visualization ➋ UK Real Estate (100) UK Used Car (100) ICQ (98) Tel-8 (436) Assistive Domain Scope: OPAL-TL by any other field. To this end, OPAL 4.1 Schema Design: OPAL-TL Controller (b) Devices if they are not overshadowed 1 TEMPLATE basic_concept<C,A> { concept<C>(N)( N@A{d,e,p} } OPAL provides a template language, OPAL - TL, for easily speci- 1 OPAL-TL extendsaDatalog¬,≠ with templates and annotation the DOM nodes: constructs layout tree from the CSS box labels of queries domain schemata reusing common Web page and their con- segment<C,A> { ➌ (c) 2 fying 2 A 4 A ➌ concepts TEMPLATE concept_by_ concept<C>(N)( N@A{e,p} } 0.98 — templates = common9. The layout forms, e.g.,given DOM P is a tuple D EFINITION pattern in web tree of a range specification straints as well as concept templates. To implement aconcept_minmax<C,C ,A> { A 4 TEMPLATE new do- (d) 0.96 (e) — rules = , , w, nw, nfor domain-specific element andN is thetypes DOM (NP constraints , ne, e, se, s, sw, aligned) where segment set of B 3 A M P main, we only need to provide (1) a set of annotators implementing 6 concept<CM >(N1 ) ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), 0.94 (f) nodes from P, , w, nw, n, . . . the “belongs to” (containment), west, TEMPLATE segment<C>{ CisLabela and isValuea and (2) an OPAL ➍ Visualization of the do- B - TL specification N1 @A{e,d},(concept<C>(N2 ) _ N2 @A{e,d}) ➍ 0.92 Ⓐ 8 concept<CM >(N ( 2 north-west, north, . . . relations from RCR [12], and aligned(x, y) interpretations. The classification yields a and structural constraints. 21)),Nchild(N1 ,G),child(N2 ,G),follows(N2N2 @A1 {d}) segment<C>(G)( outlier<C>(G),child(N1 ,G),¬ child(N2 ,G), Repairing form main types and their classification Figure 6: Example Form Labeling ,N1 ), ¬(concept<C>(N2 ) _ segment<C>(N2 )) } Ⓐ Field concept<C>(N 2 @range_connector{e,d},¬(A1 A, 0.9 (a) web page (b) page scope holds if x and y have the same height and are horizontally aligned.— annotation queries:is, however,TL extends Datalogun- form interpretation F, that labels annotated with certain type templates and predefined predi- OPAL - not necessarily a model with 10 concept<CM >(N1 ) Ⓑ ( child(N1 ,G),child(N2 ,G),adjacent(N1 ,N2 ), Airfare Auto Book Job US R.E. and may contain direct only or also group labels? derived from ex- Segment@A{e,p},N2 @A{e,p}, (N1 @min{e,p},N2 @max{e,p}) Ⓑ 4 der S,direct?either provided by human domain experts or We are look for violations ofconvenient querying of annotations and DOM nodes. An Multi-modal Input TEMPLATE segment_range<C,CM > { • cates for structural constraints. N1 segment<C>(G)( outlier<C>(G),concept<CM >(N1 ),concept<CM >(N2 ), layout the ternalat fields and segments and the segment hierarchy current OPAL automatically (N @max{e,p},N @min{e,p}) We call w, nw, . . . the neighbour relations. The adapt tree is of types sources such as DBPedia and Freebase. The Figure 4: Holbrook Moran form and page-scope labeling (Touch-Screens) 6 filled 12 _ 1 2 field segment layout domain • proper? look for labels (‘price’) or also values (‘£10’) most quadratic in size of a given DOM P and canof•F with version contains a of labels below to artefacts for common domain labeling F and a DOM be computed majority describedprogramconstruct a form the rewriting rules large set of such is executed against a form OPAL - TL N1 6= N2 ,child(N1 ,G),child(N2 ,G) } according to master 8 modelexclusive? as price,Relationstheof the type P strati- compliant with S. OPALlocation,are date. F and a are mapped in the obvious way to OPAL - OPAL - TL classification templates P. performsorfrom rewriting in TEMPLATE O(|P|2 ). For convenience, we write, e.g., w-nw-n to denote the in segment_with_unique<C,U> { types such form Figure 7: 1 Real-estate 1 Web Automation fied manner to guarantee termination only use child (descendant, resp.) for the child (descen- and introduces at most n new D EFINITION TL. We ain the form. segment<C>(G)( outlier<C>(G),child(N1 ,G), concept<U>(N1 ,G), — template: limited second-order, same data complexity & Testing 10 union of the relations w, nw, and n. segments where n is the number 11. of fields form labeling F on a DOM P and an Given ¬ child(N2 ,G),N1 6= N2 ,¬(concept<C>(N2 )_segment<C>(N2 )) . } ➎ Master formsibling order 0.99 In cultures with left-to-right reading direction, we observe a strong schema there isOPAL - TL annotationtqueryextend document and an example, the following template defines a family of con- • template variablesIfquantify over predicates or We is an expres- (1) Under Segmentation: L, resp.) relation intype such annotation dant, an a segment n with F. types As Precision domain 12 provides passe- 0.98 that Cinstantiation reduceschildto F: follows(X,Yfirst-order variable, Rfollowing (f (X), f (Y )) 2 the domain type D to a node N whenever N straints that associate Used-car TEMPLATE outlier<C>{ • T (t) requires additional to Datalog where X is ,a . .) for X,Y 2 F, if from P segments sion of the form: X@A{d, p, e} of type t1 . ,tk 62 Recall layout preference for placing labels in the w-nw-n region)) } child-T (n), we try to partition the children of n into k + 1 partitions 0 ,P),¬(segment<C>(G0 from a field. How- 0.8 A 2 A, and d, P and e are annotation modifiers. An annotation for filling partout is labeled by an exclusive direct and proper annotation of type A. F-score segment p, ever, forms often have many fields interspersed with field. .labels and that Piwithandi )no otherholds. .for } F X T (t). between X and Y in document or- 14 outlier<C>(G)( root(G)_child(G,P),child(G INSTANTIATE basic_concept<D,A> using node in {<RADIUS, occurs radius>} 0.97 0.6 0.7 0.8 0.9 1 field P1 , . , Pk ,query X@Aµ |= CT (t {d, p, n [ {t1 , . ,tk all C 2 J Aµ K with individual forms Pn such µ ✓ and Pe} (X,Y ), |= segment labels. Thusstructural constraints Figure 8: OPAL - TL we have to carefully consider overshadowing. a new segment node as For each Pi we add der; adjacentchild of n,ifclassify it (f (X), f (Y )) 2 P or vice versa. Rnext-sibling TEMPLATE basic_concept<D,A> { concept<D>(N) ( N@A{e,d,l} } 0.96 Figure 3: OPAL Interface @Aµ nodes 2 P : Allowµ (n) Matchµ labell 0} (X)) µ (A) Intuitively, for a field f , a visible text node t is overshadowedJby all K = {nassigned wePiabbreviate (A) 6= (f Blockas l(X). Finally, to from n to that segment. / 0.95 0.6 with ti , and move In practice, few cases of multiple under segmentations occur at the A template tpl is instantiated to produce a family of rules where Real estate Real-estate Used car Used-car Real-estate Used-car first two templates). It is0 the only template 0with two concept tem- 0

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. = ontology-based web form pattern analysis with logic \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n
  87. \n
  88. \n
  89. \n
  90. \n
  91. \n
  92. \n
  93. \n
  94. \n
  95. \n
  96. \n
  97. \n
  98. \n
  99. \n
  100. \n
  101. \n
  102. \n
  103. \n
  104. \n
  105. \n
  106. \n
  107. \n
  108. \n
  109. \n
  110. \n
  111. \n
  112. \n
  113. \n
  114. \n
  115. \n
  116. \n
  117. \n
  118. \n
  119. \n
  120. \n
  121. \n
  122. \n
  123. \n
  124. \n
  125. \n
  126. \n
  127. \n
  128. \n
  129. \n
  130. \n
  131. \n
  132. \n
  133. \n
  134. \n
  135. \n
  136. \n
  137. \n
  138. \n
  139. \n
  140. \n
  141. \n
  142. \n
  143. \n
  144. \n
  145. \n
  146. \n
  147. \n
  148. \n
  149. \n
  150. \n