SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Entity Matching for Semistructured Data
                                in the Cloud


                               Marcus Paradies
                            ACM SAC 2012 - CC Track
                                  March 27, 2012




Marcus Paradies               Entity Matching for Semistructured Data in the Cloud
                                                                                     1 / 19
Outline


        1   Motivation

        2   ChuQL

        3   Entity Matching

        4   MAXIM: Entity Matching in the Cloud

        5   Summary




Marcus Paradies                      Entity Matching for Semistructured Data in the Cloud
                                                                                            2 / 19
Motivation


                  Enriching/Improving Wikipedia

        References from Wikipedia article Hash join




Marcus Paradies                       Entity Matching for Semistructured Data in the Cloud
                                                                                             3 / 19
Motivation


                  Enriching/Improving Wikipedia

        Lookup in the CiteSeer database




Marcus Paradies                      Entity Matching for Semistructured Data in the Cloud
                                                                                            3 / 19
Motivation


                  Enriching/Improving Wikipedia

        Lookup in Google




Marcus Paradies                   Entity Matching for Semistructured Data in the Cloud
                                                                                         3 / 19
Motivation


                      Wikipedia in a nutshell

        Characteristics
                  3.7 Mio articles (english Wikipedia database)
                  Dataset size about 30GB of XML (without history)
                  3.6 Mio references
                  References are categorized into books, journals, websites, etc.




Marcus Paradies                                   Entity Matching for Semistructured Data in the Cloud
                                                                                                         4 / 19
Motivation


                      Wikipedia in a nutshell

        Characteristics
                  3.7 Mio articles (english Wikipedia database)
                  Dataset size about 30GB of XML (without history)
                  3.6 Mio references
                  References are categorized into books, journals, websites, etc.


        Challenges
                  Articles in Wikipedia are incomplete
                  Articles in Wikipedia are inaccurate
                  Articles in Wikipedia are subjective

Marcus Paradies                                   Entity Matching for Semistructured Data in the Cloud
                                                                                                         4 / 19
Motivation


                      Problem Statement

        Definition
        Given two datasets of records, R and S, a set of attributes
        a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a
        similarity threshold τ , the task between R and S is defined as
        finding and combining all pairs of records from R and S where
            n
            i=1 simai (R.ai , S.ai ) ≥ τ

                  {{Cite book
                   {{Cite book
                    | last = Mumford
                     | last = Mumford
                    | first = David
                     | first = David                                         <record id=”6627383”>
                                                                              <record id=”6627383”>
                    | authorlink = David Mumford
                     | authorlink = David Mumford                              <author>David Mumford</author>
                                                                                <author>David Mumford</author>
                    | title = The Red Book of Varieties and Schemes
                     | title = The Red Book of Varieties and Schemes           <title>The red book of Varieties and
                                                                                <title>The red book of Varieties and
                    | publisher = [[Springer]]
                     | publisher = [[Springer]]                              Schemes</title>
                                                                              Schemes</title>
                    | location = Berlin
                     | location = Berlin                                       <publisher>Springer</publisher>
                                                                                <publisher>Springer</publisher>
                    | date = 1999
                     | date = 1999                                             <year>1999</year>
                                                                                <year>1999</year>
                    | page = 198
                     | page = 198                                              <doi>10.1007/b62130</doi>
                                                                                <doi>10.1007/b62130</doi>
                    | doi = 10.1007/b62130
                     | doi = 10.1007/b62130                                  </record>
                                                                              </record>
                    | isbn = 354063293X
                     | isbn = 354063293X
                  }}
                   }}


                              Wikipedia Data set                                        CiteSeer Data set

Marcus Paradies                                                        Entity Matching for Semistructured Data in the Cloud
                                                                                                                              5 / 19
Motivation


                      Problem Statement

        Definition
        Given two datasets of records, R and S, a set of attributes
        a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a
        similarity threshold τ , the task between R and S is defined as
        finding and combining all pairs of records from R and S where
            n
            i=1 simai (R.ai , S.ai ) ≥ τ

                  {{Cite book
                   {{Cite book
                    | last = Mumford
                     | last = Mumford
                    | first = David
                     | first = David                                         <record id=”6627383”>
                                                                              <record id=”6627383”>
                    | authorlink = David Mumford
                     | authorlink = David Mumford                              <author>David Mumford</author>
                                                                                <author>David Mumford</author>
                    | title = The Red Book of Varieties and Schemes
                     | title = The Red Book of Varieties and Schemes           <title>The red book of Varieties and
                                                                                <title>The red book of Varieties and
                    | publisher = [[Springer]]
                     | publisher = [[Springer]]                              Schemes</title>
                                                                              Schemes</title>
                    | location = Berlin
                     | location = Berlin                                       <publisher>Springer</publisher>
                                                                                <publisher>Springer</publisher>
                    | date = 1999
                     | date = 1999                                             <year>1999</year>
                                                                                <year>1999</year>
                    | page = 198
                     | page = 198                                              <doi>10.1007/b62130</doi>
                                                                                <doi>10.1007/b62130</doi>
                    | doi = 10.1007/b62130
                     | doi = 10.1007/b62130                                  </record>
                                                                              </record>
                    | isbn = 354063293X
                     | isbn = 354063293X
                  }}
                   }}


                              Wikipedia Data set                                        CiteSeer Data set

Marcus Paradies                                                        Entity Matching for Semistructured Data in the Cloud
                                                                                                                              5 / 19
ChuQL




Marcus Paradies     Entity Matching for Semistructured Data in the Cloud
                                                                           6 / 19
ChuQL


                     ChuQL by example



        Wordcount in ChuQL
     1 mapreduce {
     2     input { fn : collection (" hdfs :// wiki /") }
     3     rr { for $rev in $hxml : in // revision
     4          return {" key ": fn : data ( $x // username | $x // ip ) ,
     5               " val ": $x // title } }
     6     map { $hxml : in }
     7     reduce { {" key ": $hxml : in = >" key " , " value ": fn : count ( $hxml : in = >" val ")} }
     8     rw { < author name ="{ $hxml : in = >" key "}" count ="{ $hxml : in = >" val "}"/ > }
     9     output { fn : put (" hdfs :// outputdir /") }
    10 }




Marcus Paradies                                            Entity Matching for Semistructured Data in the Cloud
                                                                                                                  7 / 19
Entity Matching




Marcus Paradies          Entity Matching for Semistructured Data in the Cloud
                                                                                8 / 19
Entity Matching


                  What is Entity Matching?




Marcus Paradies                     Entity Matching for Semistructured Data in the Cloud
                                                                                           9 / 19
Entity Matching


                      What is Entity Matching?




        Challenges
                  Entity Matching has quadratic runtime behavior
                  Entity Matching has high CPU- and memory demands
                  The definition of “what is similar” is domain-dependent

Marcus Paradies                                  Entity Matching for Semistructured Data in the Cloud
                                                                                                        9 / 19
Entity Matching


                  Entity Matching Architecture


                                          b11
                                          b
                    Data
                    Data
                   Source
                   Source
                     S11
                     S                    b22
                                          b                               Match
                                                                          Match
                              Blocking
                              Blocking                  Matching
                                                        Matching          Result
                                                                          Result
                                                                            R
                                                                            R
                                          b33
                                          b
                    Data
                    Data




                                         ...
                   Source
                   Source
                     S22
                     S
                                          bnn
                                          b




Marcus Paradies                          Entity Matching for Semistructured Data in the Cloud
                                                                                                10 / 19
Entity Matching


                  Entity Matching Architecture

                                          b11
                                          b
                    Data
                    Data
                   Source
                   Source
                     S11
                     S                    b22
                                          b                               Match
                                                                          Match
                              Blocking
                              Blocking                  Matching
                                                        Matching          Result
                                                                          Result
                                                                            R
                                                                            R
                                          b33
                                          b
                     Data
                     Data




                                         ...
                    Source
                    Source
                      S22
                      S
                                          bnn
                                          b



                  How can we improve the runtime of an EM task?




Marcus Paradies                          Entity Matching for Semistructured Data in the Cloud
                                                                                                10 / 19
Entity Matching


                  Entity Matching Architecture


                                                    b11
                                                    b
                    Data
                    Data
                   Source
                   Source
                     S11
                     S                              b22
                                                    b                               Match
                                                                                    Match
                                Blocking
                                Blocking                          Matching
                                                                  Matching          Result
                                                                                    Result
                                                                                      R
                                                                                      R
                                                    b33
                                                    b
                    Data
                    Data




                                                   ...
                   Source
                   Source
                     S22
                     S
                                                    bnn
                                                    b
                            Distributed Blocking




Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          10 / 19
Entity Matching


                  Entity Matching Architecture


                                                    b11
                                                    b
                    Data
                    Data
                   Source
                   Source
                     S11
                     S                              b22
                                                    b                               Match
                                                                                    Match
                                Blocking
                                Blocking                          Matching
                                                                  Matching          Result
                                                                                    Result
                                                                                      R
                                                                                      R
                                                    b33
                                                    b
                    Data
                    Data




                                                   ...
                   Source
                   Source
                     S22
                     S
                                                    bnn
                                                    b
                            Distributed Blocking               Parallel Matching




Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          10 / 19
MAXIM: Entity Matching
             in the Cloud




Marcus Paradies     Entity Matching for Semistructured Data in the Cloud
                                                                           11 / 19
MAXIM: Entity Matching in the Cloud


                      Requirements and Approach

        Requirements
                  Efficient processing of semistructured data
                  Scalability to large datasets
                  Independency from specific similarity functions
                  Ability to easily add new similarity functions




Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          12 / 19
MAXIM: Entity Matching in the Cloud


                      Requirements and Approach

        Requirements
                  Efficient processing of semistructured data
                  Scalability to large datasets
                  Independency from specific similarity functions
                  Ability to easily add new similarity functions


        Main Idea
                  Use MapReduce and ChuQL to process semistructured data
                  Use a search-based blocking to generate candidate pairs
                  Apply similarity functions to candidate pairs within a block

Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          12 / 19
MAXIM: Entity Matching in the Cloud


                      Architecture
                            Search                   Node 1    Search                    Node 2             Search                    Node N
                            Engine                             Engine                                       Engine
                                               Data Node                           Data Node                                    Data Node
                                                                                                    ...


                                      Hadoop




                                                                          Hadoop




                                                                                                                       Hadoop
                          Full-text            Task Tracker   Full-text            Task Tracker            Full-text            Task Tracker
                           Index                               Index                                        Index
                                               ChuQL Engine                        ChuQL Engine                                 ChuQL Engine




                                                                                   HDFS
                                                                                   HDFS




        Architecture
                  Hadoop cluster with up to 40 nodes
                  Each node runs a search engine and an attached full-text index
                  Each node runs an in-memory XQuery processor
                  Semistructured data is partitioned and placed on HDFS


Marcus Paradies                                                                              Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                    13 / 19
MAXIM: Entity Matching in the Cloud


                      Processing Stages
                                             Search Engines
                                             Search Engines
                                                 HDFS
                                                 HDFS




        Three Stages
                  Preparation Stage
                  Blocking Stage
                  Matching Stage


Marcus Paradies                                       Entity Matching for Semistructured Data in the Cloud
                                                                                                             14 / 19
MAXIM: Entity Matching in the Cloud


                      Processing Stages
                                                                             Search Engines
                                                                             Search Engines
                                                                                 HDFS
                                                                                 HDFS

                                                    Transform
                         Extract      Store         into full-text   Build
                         references   references    index XML        index




                       Extract Wikipedia
                       Extract Wikipedia           Index CiteSeerX
                                                    Index CiteSeerX
                           references
                            references                  records
                                                         records


                                 Preparation Stage




        Stage 1: Preparation Stage
                  Extracts references from Wikipedia
                  Reads and transforms records from CiteSeerX
                  Sends CiteSeerX data to local full-text index


Marcus Paradies                                                                       Entity Matching for Semistructured Data in the Cloud
                                                                                                                                             14 / 19
MAXIM: Entity Matching in the Cloud


                      Processing Stages
                                                                             Search Engines
                                                                             Search Engines
                                                                                 HDFS
                                                                                 HDFS

                                                    Transform
                         Extract      Store         into full-text   Build     Retrieve      Generate
                         references   references                                                        Get query   Store
                                                    index XML        index     references    query      response    blocks



                       Extract Wikipedia
                       Extract Wikipedia           Index CiteSeerX
                                                    Index CiteSeerX                    Generate Semantic
                                                                                       Generate Semantic
                           references
                            references                  records
                                                         records                             Block
                                                                                             Block


                                 Preparation Stage                                          Blocking Stage




        Stage 2: Blocking Stage
                  Reads extracted references from HDFS
                  Probes full-text index to retrieve candidate publications
                  Assign candidate publications to block(s)


Marcus Paradies                                                                             Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                   14 / 19
MAXIM: Entity Matching in the Cloud


                      Processing Stages
                                                                             Search Engines
                                                                             Search Engines
                                                                                 HDFS
                                                                                 HDFS

                                                    Transform                                                                              Store
                         Extract      Store         into full-text   Build     Retrieve      Generate                          Verify
                         references   references                                                        Get query   Store                  record
                                                    index XML        index     references    query                             candidate
                                                                                                        response    blocks                 pairs
                                                                                                                               pairs



                       Extract Wikipedia
                       Extract Wikipedia           Index CiteSeerX
                                                    Index CiteSeerX                    Generate Semantic
                                                                                       Generate Semantic                     Record pair generation
                                                                                                                             Record pair generation
                           references
                            references                  records
                                                         records                             Block
                                                                                             Block


                                 Preparation Stage                                          Blocking Stage                    Matching Stage




        Stage 3: Matching Stage
                  Read blocks from HDFS
                  Generate candidate pairs and apply similarity functions
                  Store matching pairs and their similarity


Marcus Paradies                                                                             Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                      14 / 19
MAXIM: Entity Matching in the Cloud


                  Stage 1: Preparation Stage

     Extracting References                       Indexing Publications




Marcus Paradies                                Entity Matching for Semistructured Data in the Cloud
                                                                                                      15 / 19
MAXIM: Entity Matching in the Cloud


                                  Stage 1: Preparation Stage

     Extracting References                                                            Indexing Publications



                                                   Extraction


          {{cite journal
            | author1 = Hansjörg Zeller
            | author2 = Jim Gray
            | title = An Adaptive Hash Join Algorithm for Multi-User Environments
            | journal = Proceedings of the 16th VLDB conference
            | year = 1990
            | pages = 186–197
          }}




Marcus Paradies                                                                     Entity Matching for Semistructured Data in the Cloud
                                                                                                                                           15 / 19
MAXIM: Entity Matching in the Cloud


                                  Stage 1: Preparation Stage

     Extracting References                                                            Indexing Publications



                                                   Extraction


          {{cite journal
            | author1 = Hansjörg Zeller
            | author2 = Jim Gray
            | title = An Adaptive Hash Join Algorithm for Multi-User Environments
            | journal = Proceedings of the 16th VLDB conference
            | year = 1990
            | pages = 186–197
          }}



                                                   Transformation


          <reference type=“journal“>
           <author1>Hansjörg Zeller</author1>
           <author2>Jim Gray</author2>
           <title>An Adaptive Hash Join Algorithm for Multi-User
          Environments</title>
           <journal>Proceedings of the 16th VLDB conference</journal>
           <year>1990</year>
           <pages>186–197</pages>
          </reference>


Marcus Paradies                                                                     Entity Matching for Semistructured Data in the Cloud
                                                                                                                                           15 / 19
MAXIM: Entity Matching in the Cloud


                                  Stage 1: Preparation Stage

     Extracting References                                                            Indexing Publications


                                                                                                          HDFS
                                                   Extraction


          {{cite journal
            | author1 = Hansjörg Zeller
            | author2 = Jim Gray
            | title = An Adaptive Hash Join Algorithm for Multi-User Environments
            | journal = Proceedings of the 16th VLDB conference
            | year = 1990
            | pages = 186–197
          }}



                                                   Transformation


          <reference type=“journal“>
           <author1>Hansjörg Zeller</author1>
           <author2>Jim Gray</author2>
           <title>An Adaptive Hash Join Algorithm for Multi-User
          Environments</title>
           <journal>Proceedings of the 16th VLDB conference</journal>
           <year>1990</year>
           <pages>186–197</pages>
          </reference>


Marcus Paradies                                                                     Entity Matching for Semistructured Data in the Cloud
                                                                                                                                           15 / 19
MAXIM: Entity Matching in the Cloud


                                  Stage 1: Preparation Stage

     Extracting References                                                            Indexing Publications


                                                                                                                 HDFS
                                                   Extraction

                                                                                                                        Read and Transformation
          {{cite journal
            | author1 = Hansjörg Zeller
            | author2 = Jim Gray
                                                                                          <doc>
            | title = An Adaptive Hash Join Algorithm for Multi-User Environments             <field name="id">10.1.1.49.2550</field>
            | journal = Proceedings of the 16th VLDB conference                               <field name="title">Selecting Tense, Aspect, and
            | year = 1990                                                                             Connecting Words In Language
                                                                                          Generation</field>
            | pages = 186–197                                                                 <field name="author">Bonnie Dorr</field>
          }}                                                                                  <field name="description">Generating language
                                                                                          ...</field>
                                                                                            </doc>

                                                   Transformation


          <reference type=“journal“>
           <author1>Hansjörg Zeller</author1>
           <author2>Jim Gray</author2>
           <title>An Adaptive Hash Join Algorithm for Multi-User
          Environments</title>
           <journal>Proceedings of the 16th VLDB conference</journal>
           <year>1990</year>
           <pages>186–197</pages>
          </reference>


Marcus Paradies                                                                     Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                  15 / 19
MAXIM: Entity Matching in the Cloud


                                  Stage 1: Preparation Stage

     Extracting References                                                            Indexing Publications


                                                                                                                 HDFS
                                                   Extraction

                                                                                                                        Read and Transformation
          {{cite journal
            | author1 = Hansjörg Zeller
            | author2 = Jim Gray
                                                                                          <doc>
            | title = An Adaptive Hash Join Algorithm for Multi-User Environments             <field name="id">10.1.1.49.2550</field>
            | journal = Proceedings of the 16th VLDB conference                               <field name="title">Selecting Tense, Aspect, and
            | year = 1990                                                                             Connecting Words In Language
                                                                                          Generation</field>
            | pages = 186–197                                                                 <field name="author">Bonnie Dorr</field>
          }}                                                                                  <field name="description">Generating language
                                                                                          ...</field>
                                                                                            </doc>

                                                   Transformation
                                                                                                                        Indexing
          <reference type=“journal“>
           <author1>Hansjörg Zeller</author1>
           <author2>Jim Gray</author2>
           <title>An Adaptive Hash Join Algorithm for Multi-User
          Environments</title>                                                                                  Lucene
                                                                                                                Lucene
           <journal>Proceedings of the 16th VLDB conference</journal>                                            Index
                                                                                                                  Index
           <year>1990</year>
           <pages>186–197</pages>
          </reference>


Marcus Paradies                                                                     Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                  15 / 19
MAXIM: Entity Matching in the Cloud


                      Stage 2: Blocking Stage




        Block generation
                  Each reference generates a set of candidate publications
                  Each candidate publication is inserted into all blocks, which are
                  listed in reference




Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          16 / 19
MAXIM: Entity Matching in the Cloud


                      Stage 2: Blocking Stage

        Block generation
                  Each reference generates a set of candidate publications
                  Each candidate publication is inserted into all blocks, which are
                  listed in reference


        Example

                                                                                                                        Hashing
                             <citation>
                               <citation>
                               <id>26334893</id>
                                 <id>26334893</id>
                          <citation>
                               <cat>Search engine optimization</cat>
                                  <cat>Search engine optimization</cat>
                            <id>26334893</id>                                                                     10.0.1.1.124
                            <cat>Hashing</cat> search algorithms</cat>
                               <cat>Internet
                                  <cat>Internet search algorithms</cat>                  Search Engine
                               <cat>Link analysis</cat>
                                  <cat>Link analysis</cat>
                            <cat>Join algorithms</cat>
                            <ref> <ref>
                               <ref>                                                                                   10.0.1.11.23
                                  <type>journal</type>
                                    <type>journal</type>
                              <type>journal</type>
                                  <author>Taher Haveliwala</author>
                                    <author>Taher Haveliwala</author>
                              <author>Hansjörg Zeller</author>                                           send result
                              <author>Jim Gray</author>
                                  <year>2003</year>
                                    <year>2003</year>                                        Full-Text
                              <year>1990</year>                           send query          Index
                                  <pages>56-70</pages>
                                    <pages>56-70</pages>
                              <pages>186-197</pages> Eigenvalue
                                  <title>The Second
                                    <title>The Second Eigenvalue                                         send result
                              <title>An AdaptiveGoogle Matrix</title>
                                         ofof the Hash JoinMatrix</title>
                                            the    Google Algorithm                                                       Join
                                  for Multiuser Environments</title>
                                  <journal>Stanford University
                                    <journal>Stanford University
                              <journal>Proceedings of the 16th VLDB                                                    algorithms
                                           Technical Report</journal>
                                             Technical Report</journal>
                                        conference</journal>
                               </ref>
                            </ref></ref>                                                                           10.0.1.1.124
                             </citation>
                               </citation>
                          </citation>
                                                                                                                       10.0.7.23.14




Marcus Paradies                                                                        Entity Matching for Semistructured Data in the Cloud
                                                                                                                                              16 / 19
MAXIM: Entity Matching in the Cloud


                     Stage 2: Blocking Stage

        Distributed Search in MAXIM

                  (a) Send HTTP request (query)                                   Search                 Node 1                  (c)
                                                                                  Engine
                  (b) HTTP response (partial result)                                                  Data Node




                                                                                             Hadoop
                  (c) Collect partial results                                    Full-text            Task Tracker
                                                                                  Index
                                                                                                      ChuQL Engine




                                                                                                                                                   (a)
                                                         )
                                                      (a




                                                                                                                     (a)
                                                                                (a)
                                                                                                                                             (b)
                                                             (b)




                                                                                                           (b)
                                                                                      (b)




                   Search                 Node 2         Search                 Node 3                    Search                 Node 4                  Search                  Node 5
                   Engine                                Engine                                           Engine                                         Engine
                                       Data Node                             Data Node                                        Data Node                                       Data Node
                              Hadoop




                                                                                                                                                                     Hadoop
                                                                    Hadoop




                                                                                                                     Hadoop
                  Full-text            Task Tracker     Full-text            Task Tracker                Full-text            Task Tracker               Full-text            Task Tracker
                   Index                                 Index                                            Index                                           Index
                                       ChuQL Engine                          ChuQL Engine                                     ChuQL Engine                                    ChuQL Engine




Marcus Paradies                                                                                            Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                                                             16 / 19
MAXIM: Entity Matching in the Cloud


                       Stage 3: Matching Stage
                  Applies user-defined similarity functions to candidate pairs
                  Each attribute can be evaluated by a specific similarity function




Marcus Paradies                                     Entity Matching for Semistructured Data in the Cloud
                                                                                                           17 / 19
MAXIM: Entity Matching in the Cloud


                       Stage 3: Matching Stage
                  Applies user-defined similarity functions to candidate pairs
                  Each attribute can be evaluated by a specific similarity function


        Number of candidate pairs

                                                   n
                                         CP =           Ci ∗ Ri                                    (1)
                                                  i=1


                  n - # of blocks in B1 , . . . , Bn
                  Ri - # of references in block Bi
                  Ci - # of candidate publications in block Bi
                  CP - # of candidate pairs to verify
Marcus Paradies                                         Entity Matching for Semistructured Data in the Cloud
                                                                                                               17 / 19
Summary


                       Summary

                  Wikipedia provides many opportunities for research
                  Need for efficiently processing semistructured data is increasing
                  Entity Matching is critical for data integration and data cleaning
                  Entity Matching is difficult to parallelize due to unbalanced data
                  partitions
                  MAXIM parallelizes EM by building blocks of similar records in a
                  classification fashion
                  MAXIM allows to define own similarity functions and computation
                  functions without changing the algorithm



Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          18 / 19
“Everything that can be invented has been invented.”
                          (Charles H. Duell, Commissioner, U.S. Office of Patents, 1899)




Marcus Paradies                                           Entity Matching for Semistructured Data in the Cloud
                                                                                                                 19 / 19
Experiments


                                                            Scaleup and Speedup


                                           9                                                                                    2
                                                              Ideal                                                                                                 Ideal
                                                     INDEXING-2000                                                             1.8                       EXTRACTING-2000
                                           8       EXTRACTING-2000                                                                                         INDEXING-2000
          Speedup = Base Time / New Time




                                                                                              Scaleup = Base Time / New Time
                                                         BLOCKING                                                              1.6
                                           7             MATCHING
                                                                                                                               1.4
                                           6
                                                                                                                               1.2

                                           5                                                                                    1

                                                                                                                               0.8
                                           4
                                                                                                                               0.6
                                           3
                                                                                                                               0.4
                                           2
                                                                                                                               0.2

                                           1                                                                                    0
                                               5                 10                 20   40                                          5    10                 20             40
                                                                      Number of nodes                                                          Number of nodes


                                                    (a) Speedup for all stages                                                 (b) Scaleup for preparation stage




Marcus Paradies                                                                                            Entity Matching for Semistructured Data in the Cloud
                                                                                                                                                                                 20 / 23
Experiments


                  Query Performance

                                                     900
                                                                      RESULTCOUNT-50
                     Avg. Query Response Time (ms)
                                                     800             RESULTCOUNT-100
                                                                     RESULTCOUNT-150
                                                     700             RESULTCOUNT-200

                                                     600

                                                     500

                                                     400

                                                     300

                                                     200

                                                     100

                                                      0
                                                           5   10           20           40
                                                               Number of Nodes


        Figure: Query Performance for different result set sizes and cluster sizes.


Marcus Paradies                                                       Entity Matching for Semistructured Data in the Cloud
                                                                                                                             21 / 23
Experiments


                   Blocking Accuracy
                                 1.2
                                                                   Ideal
                                                         WRONG-ORDER
                                 1.1                     MISPLACED-END
                                                         MISPLACED-ANY
                                                               MISSING
                                  1
                      Accuracy




                                 0.9

                                 0.8

                                 0.7

                                 0.6

                                 0.5
                                       0   0.25     0.5          0.75          1.0
                                                  Variance


            Figure: Blocking accuracy for different typographical error classes.


Marcus Paradies                                    Entity Matching for Semistructured Data in the Cloud
                                                                                                          22 / 23
Experiments


                  Number of Candidate Pairs

                                                 5.5e+006
                                                           RSCOUNT-50
                                                   5e+006 RSCOUNT-100
                                                          RSCOUNT-150
                                                 4.5e+006 RSCOUNT-200
                     Number of candidate pairs


                                                  4e+006
                                                 3.5e+006
                                                  3e+006
                                                 2.5e+006
                                                  2e+006
                                                 1.5e+006
                                                  1e+006
                                                  500000
                                                       0
                                                            0.0   0.1   0.25       0.5     0.75        1.0
                                                                           Variance


          Figure: Number of candidate pair verifications in the matching stage.


Marcus Paradies                                                           Entity Matching for Semistructured Data in the Cloud
                                                                                                                                 23 / 23

Más contenido relacionado

Último

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Último (20)

Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Entity Matching for Semistructured Data in the Cloud

  • 1. Entity Matching for Semistructured Data in the Cloud Marcus Paradies ACM SAC 2012 - CC Track March 27, 2012 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 1 / 19
  • 2. Outline 1 Motivation 2 ChuQL 3 Entity Matching 4 MAXIM: Entity Matching in the Cloud 5 Summary Marcus Paradies Entity Matching for Semistructured Data in the Cloud 2 / 19
  • 3. Motivation Enriching/Improving Wikipedia References from Wikipedia article Hash join Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 19
  • 4. Motivation Enriching/Improving Wikipedia Lookup in the CiteSeer database Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 19
  • 5. Motivation Enriching/Improving Wikipedia Lookup in Google Marcus Paradies Entity Matching for Semistructured Data in the Cloud 3 / 19
  • 6. Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 19
  • 7. Motivation Wikipedia in a nutshell Characteristics 3.7 Mio articles (english Wikipedia database) Dataset size about 30GB of XML (without history) 3.6 Mio references References are categorized into books, journals, websites, etc. Challenges Articles in Wikipedia are incomplete Articles in Wikipedia are inaccurate Articles in Wikipedia are subjective Marcus Paradies Entity Matching for Semistructured Data in the Cloud 4 / 19
  • 8. Motivation Problem Statement Definition Given two datasets of records, R and S, a set of attributes a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where n i=1 simai (R.ai , S.ai ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford | first = David | first = David <record id=”6627383”> <record id=”6627383”> | authorlink = David Mumford | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | title = The Red Book of Varieties and Schemes | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> Schemes</title> | location = Berlin | location = Berlin <publisher>Springer</publisher> <publisher>Springer</publisher> | date = 1999 | date = 1999 <year>1999</year> <year>1999</year> | page = 198 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data set Marcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 19
  • 9. Motivation Problem Statement Definition Given two datasets of records, R and S, a set of attributes a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a similarity threshold τ , the task between R and S is defined as finding and combining all pairs of records from R and S where n i=1 simai (R.ai , S.ai ) ≥ τ {{Cite book {{Cite book | last = Mumford | last = Mumford | first = David | first = David <record id=”6627383”> <record id=”6627383”> | authorlink = David Mumford | authorlink = David Mumford <author>David Mumford</author> <author>David Mumford</author> | title = The Red Book of Varieties and Schemes | title = The Red Book of Varieties and Schemes <title>The red book of Varieties and <title>The red book of Varieties and | publisher = [[Springer]] | publisher = [[Springer]] Schemes</title> Schemes</title> | location = Berlin | location = Berlin <publisher>Springer</publisher> <publisher>Springer</publisher> | date = 1999 | date = 1999 <year>1999</year> <year>1999</year> | page = 198 | page = 198 <doi>10.1007/b62130</doi> <doi>10.1007/b62130</doi> | doi = 10.1007/b62130 | doi = 10.1007/b62130 </record> </record> | isbn = 354063293X | isbn = 354063293X }} }} Wikipedia Data set CiteSeer Data set Marcus Paradies Entity Matching for Semistructured Data in the Cloud 5 / 19
  • 10. ChuQL Marcus Paradies Entity Matching for Semistructured Data in the Cloud 6 / 19
  • 11. ChuQL ChuQL by example Wordcount in ChuQL 1 mapreduce { 2 input { fn : collection (" hdfs :// wiki /") } 3 rr { for $rev in $hxml : in // revision 4 return {" key ": fn : data ( $x // username | $x // ip ) , 5 " val ": $x // title } } 6 map { $hxml : in } 7 reduce { {" key ": $hxml : in = >" key " , " value ": fn : count ( $hxml : in = >" val ")} } 8 rw { < author name ="{ $hxml : in = >" key "}" count ="{ $hxml : in = >" val "}"/ > } 9 output { fn : put (" hdfs :// outputdir /") } 10 } Marcus Paradies Entity Matching for Semistructured Data in the Cloud 7 / 19
  • 12. Entity Matching Marcus Paradies Entity Matching for Semistructured Data in the Cloud 8 / 19
  • 13. Entity Matching What is Entity Matching? Marcus Paradies Entity Matching for Semistructured Data in the Cloud 9 / 19
  • 14. Entity Matching What is Entity Matching? Challenges Entity Matching has quadratic runtime behavior Entity Matching has high CPU- and memory demands The definition of “what is similar” is domain-dependent Marcus Paradies Entity Matching for Semistructured Data in the Cloud 9 / 19
  • 15. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 16. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b How can we improve the runtime of an EM task? Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 17. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b Distributed Blocking Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 18. Entity Matching Entity Matching Architecture b11 b Data Data Source Source S11 S b22 b Match Match Blocking Blocking Matching Matching Result Result R R b33 b Data Data ... Source Source S22 S bnn b Distributed Blocking Parallel Matching Marcus Paradies Entity Matching for Semistructured Data in the Cloud 10 / 19
  • 19. MAXIM: Entity Matching in the Cloud Marcus Paradies Entity Matching for Semistructured Data in the Cloud 11 / 19
  • 20. MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Marcus Paradies Entity Matching for Semistructured Data in the Cloud 12 / 19
  • 21. MAXIM: Entity Matching in the Cloud Requirements and Approach Requirements Efficient processing of semistructured data Scalability to large datasets Independency from specific similarity functions Ability to easily add new similarity functions Main Idea Use MapReduce and ChuQL to process semistructured data Use a search-based blocking to generate candidate pairs Apply similarity functions to candidate pairs within a block Marcus Paradies Entity Matching for Semistructured Data in the Cloud 12 / 19
  • 22. MAXIM: Entity Matching in the Cloud Architecture Search Node 1 Search Node 2 Search Node N Engine Engine Engine Data Node Data Node Data Node ... Hadoop Hadoop Hadoop Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Index Index Index ChuQL Engine ChuQL Engine ChuQL Engine HDFS HDFS Architecture Hadoop cluster with up to 40 nodes Each node runs a search engine and an attached full-text index Each node runs an in-memory XQuery processor Semistructured data is partitioned and placed on HDFS Marcus Paradies Entity Matching for Semistructured Data in the Cloud 13 / 19
  • 23. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Three Stages Preparation Stage Blocking Stage Matching Stage Marcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 24. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Transform Extract Store into full-text Build references references index XML index Extract Wikipedia Extract Wikipedia Index CiteSeerX Index CiteSeerX references references records records Preparation Stage Stage 1: Preparation Stage Extracts references from Wikipedia Reads and transforms records from CiteSeerX Sends CiteSeerX data to local full-text index Marcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 25. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Transform Extract Store into full-text Build Retrieve Generate references references Get query Store index XML index references query response blocks Extract Wikipedia Extract Wikipedia Index CiteSeerX Index CiteSeerX Generate Semantic Generate Semantic references references records records Block Block Preparation Stage Blocking Stage Stage 2: Blocking Stage Reads extracted references from HDFS Probes full-text index to retrieve candidate publications Assign candidate publications to block(s) Marcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 26. MAXIM: Entity Matching in the Cloud Processing Stages Search Engines Search Engines HDFS HDFS Transform Store Extract Store into full-text Build Retrieve Generate Verify references references Get query Store record index XML index references query candidate response blocks pairs pairs Extract Wikipedia Extract Wikipedia Index CiteSeerX Index CiteSeerX Generate Semantic Generate Semantic Record pair generation Record pair generation references references records records Block Block Preparation Stage Blocking Stage Matching Stage Stage 3: Matching Stage Read blocks from HDFS Generate candidate pairs and apply similarity functions Store matching pairs and their similarity Marcus Paradies Entity Matching for Semistructured Data in the Cloud 14 / 19
  • 27. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 28. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications Extraction {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 29. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications Extraction {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} Transformation <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference> Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 30. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications HDFS Extraction {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197 }} Transformation <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference> Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 31. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications HDFS Extraction Read and Transformation {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray <doc> | title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field> | journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and | year = 1990 Connecting Words In Language Generation</field> | pages = 186–197 <field name="author">Bonnie Dorr</field> }} <field name="description">Generating language ...</field> </doc> Transformation <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages> </reference> Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 32. MAXIM: Entity Matching in the Cloud Stage 1: Preparation Stage Extracting References Indexing Publications HDFS Extraction Read and Transformation {{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray <doc> | title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field> | journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and | year = 1990 Connecting Words In Language Generation</field> | pages = 186–197 <field name="author">Bonnie Dorr</field> }} <field name="description">Generating language ...</field> </doc> Transformation Indexing <reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> Lucene Lucene <journal>Proceedings of the 16th VLDB conference</journal> Index Index <year>1990</year> <pages>186–197</pages> </reference> Marcus Paradies Entity Matching for Semistructured Data in the Cloud 15 / 19
  • 33. MAXIM: Entity Matching in the Cloud Stage 2: Blocking Stage Block generation Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference Marcus Paradies Entity Matching for Semistructured Data in the Cloud 16 / 19
  • 34. MAXIM: Entity Matching in the Cloud Stage 2: Blocking Stage Block generation Each reference generates a set of candidate publications Each candidate publication is inserted into all blocks, which are listed in reference Example Hashing <citation> <citation> <id>26334893</id> <id>26334893</id> <citation> <cat>Search engine optimization</cat> <cat>Search engine optimization</cat> <id>26334893</id> 10.0.1.1.124 <cat>Hashing</cat> search algorithms</cat> <cat>Internet <cat>Internet search algorithms</cat> Search Engine <cat>Link analysis</cat> <cat>Link analysis</cat> <cat>Join algorithms</cat> <ref> <ref> <ref> 10.0.1.11.23 <type>journal</type> <type>journal</type> <type>journal</type> <author>Taher Haveliwala</author> <author>Taher Haveliwala</author> <author>Hansjörg Zeller</author> send result <author>Jim Gray</author> <year>2003</year> <year>2003</year> Full-Text <year>1990</year> send query Index <pages>56-70</pages> <pages>56-70</pages> <pages>186-197</pages> Eigenvalue <title>The Second <title>The Second Eigenvalue send result <title>An AdaptiveGoogle Matrix</title> ofof the Hash JoinMatrix</title> the Google Algorithm Join for Multiuser Environments</title> <journal>Stanford University <journal>Stanford University <journal>Proceedings of the 16th VLDB algorithms Technical Report</journal> Technical Report</journal> conference</journal> </ref> </ref></ref> 10.0.1.1.124 </citation> </citation> </citation> 10.0.7.23.14 Marcus Paradies Entity Matching for Semistructured Data in the Cloud 16 / 19
  • 35. MAXIM: Entity Matching in the Cloud Stage 2: Blocking Stage Distributed Search in MAXIM (a) Send HTTP request (query) Search Node 1 (c) Engine (b) HTTP response (partial result) Data Node Hadoop (c) Collect partial results Full-text Task Tracker Index ChuQL Engine (a) ) (a (a) (a) (b) (b) (b) (b) Search Node 2 Search Node 3 Search Node 4 Search Node 5 Engine Engine Engine Engine Data Node Data Node Data Node Data Node Hadoop Hadoop Hadoop Hadoop Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Index Index Index Index ChuQL Engine ChuQL Engine ChuQL Engine ChuQL Engine Marcus Paradies Entity Matching for Semistructured Data in the Cloud 16 / 19
  • 36. MAXIM: Entity Matching in the Cloud Stage 3: Matching Stage Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function Marcus Paradies Entity Matching for Semistructured Data in the Cloud 17 / 19
  • 37. MAXIM: Entity Matching in the Cloud Stage 3: Matching Stage Applies user-defined similarity functions to candidate pairs Each attribute can be evaluated by a specific similarity function Number of candidate pairs n CP = Ci ∗ Ri (1) i=1 n - # of blocks in B1 , . . . , Bn Ri - # of references in block Bi Ci - # of candidate publications in block Bi CP - # of candidate pairs to verify Marcus Paradies Entity Matching for Semistructured Data in the Cloud 17 / 19
  • 38. Summary Summary Wikipedia provides many opportunities for research Need for efficiently processing semistructured data is increasing Entity Matching is critical for data integration and data cleaning Entity Matching is difficult to parallelize due to unbalanced data partitions MAXIM parallelizes EM by building blocks of similar records in a classification fashion MAXIM allows to define own similarity functions and computation functions without changing the algorithm Marcus Paradies Entity Matching for Semistructured Data in the Cloud 18 / 19
  • 39. “Everything that can be invented has been invented.” (Charles H. Duell, Commissioner, U.S. Office of Patents, 1899) Marcus Paradies Entity Matching for Semistructured Data in the Cloud 19 / 19
  • 40. Experiments Scaleup and Speedup 9 2 Ideal Ideal INDEXING-2000 1.8 EXTRACTING-2000 8 EXTRACTING-2000 INDEXING-2000 Speedup = Base Time / New Time Scaleup = Base Time / New Time BLOCKING 1.6 7 MATCHING 1.4 6 1.2 5 1 0.8 4 0.6 3 0.4 2 0.2 1 0 5 10 20 40 5 10 20 40 Number of nodes Number of nodes (a) Speedup for all stages (b) Scaleup for preparation stage Marcus Paradies Entity Matching for Semistructured Data in the Cloud 20 / 23
  • 41. Experiments Query Performance 900 RESULTCOUNT-50 Avg. Query Response Time (ms) 800 RESULTCOUNT-100 RESULTCOUNT-150 700 RESULTCOUNT-200 600 500 400 300 200 100 0 5 10 20 40 Number of Nodes Figure: Query Performance for different result set sizes and cluster sizes. Marcus Paradies Entity Matching for Semistructured Data in the Cloud 21 / 23
  • 42. Experiments Blocking Accuracy 1.2 Ideal WRONG-ORDER 1.1 MISPLACED-END MISPLACED-ANY MISSING 1 Accuracy 0.9 0.8 0.7 0.6 0.5 0 0.25 0.5 0.75 1.0 Variance Figure: Blocking accuracy for different typographical error classes. Marcus Paradies Entity Matching for Semistructured Data in the Cloud 22 / 23
  • 43. Experiments Number of Candidate Pairs 5.5e+006 RSCOUNT-50 5e+006 RSCOUNT-100 RSCOUNT-150 4.5e+006 RSCOUNT-200 Number of candidate pairs 4e+006 3.5e+006 3e+006 2.5e+006 2e+006 1.5e+006 1e+006 500000 0 0.0 0.1 0.25 0.5 0.75 1.0 Variance Figure: Number of candidate pair verifications in the matching stage. Marcus Paradies Entity Matching for Semistructured Data in the Cloud 23 / 23