Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Entity Matching for Semistructured Data in the Cloud
1. Entity Matching for Semistructured Data
in the Cloud
Marcus Paradies
ACM SAC 2012 - CC Track
March 27, 2012
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
1 / 19
2. Outline
1 Motivation
2 ChuQL
3 Entity Matching
4 MAXIM: Entity Matching in the Cloud
5 Summary
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
2 / 19
3. Motivation
Enriching/Improving Wikipedia
References from Wikipedia article Hash join
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
3 / 19
4. Motivation
Enriching/Improving Wikipedia
Lookup in the CiteSeer database
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
3 / 19
5. Motivation
Enriching/Improving Wikipedia
Lookup in Google
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
3 / 19
6. Motivation
Wikipedia in a nutshell
Characteristics
3.7 Mio articles (english Wikipedia database)
Dataset size about 30GB of XML (without history)
3.6 Mio references
References are categorized into books, journals, websites, etc.
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
4 / 19
7. Motivation
Wikipedia in a nutshell
Characteristics
3.7 Mio articles (english Wikipedia database)
Dataset size about 30GB of XML (without history)
3.6 Mio references
References are categorized into books, journals, websites, etc.
Challenges
Articles in Wikipedia are incomplete
Articles in Wikipedia are inaccurate
Articles in Wikipedia are subjective
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
4 / 19
8. Motivation
Problem Statement
Definition
Given two datasets of records, R and S, a set of attributes
a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a
similarity threshold τ , the task between R and S is defined as
finding and combining all pairs of records from R and S where
n
i=1 simai (R.ai , S.ai ) ≥ τ
{{Cite book
{{Cite book
| last = Mumford
| last = Mumford
| first = David
| first = David <record id=”6627383”>
<record id=”6627383”>
| authorlink = David Mumford
| authorlink = David Mumford <author>David Mumford</author>
<author>David Mumford</author>
| title = The Red Book of Varieties and Schemes
| title = The Red Book of Varieties and Schemes <title>The red book of Varieties and
<title>The red book of Varieties and
| publisher = [[Springer]]
| publisher = [[Springer]] Schemes</title>
Schemes</title>
| location = Berlin
| location = Berlin <publisher>Springer</publisher>
<publisher>Springer</publisher>
| date = 1999
| date = 1999 <year>1999</year>
<year>1999</year>
| page = 198
| page = 198 <doi>10.1007/b62130</doi>
<doi>10.1007/b62130</doi>
| doi = 10.1007/b62130
| doi = 10.1007/b62130 </record>
</record>
| isbn = 354063293X
| isbn = 354063293X
}}
}}
Wikipedia Data set CiteSeer Data set
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
5 / 19
9. Motivation
Problem Statement
Definition
Given two datasets of records, R and S, a set of attributes
a1 , . . . , an , a set of similarity functions sima1 , . . . , siman and a
similarity threshold τ , the task between R and S is defined as
finding and combining all pairs of records from R and S where
n
i=1 simai (R.ai , S.ai ) ≥ τ
{{Cite book
{{Cite book
| last = Mumford
| last = Mumford
| first = David
| first = David <record id=”6627383”>
<record id=”6627383”>
| authorlink = David Mumford
| authorlink = David Mumford <author>David Mumford</author>
<author>David Mumford</author>
| title = The Red Book of Varieties and Schemes
| title = The Red Book of Varieties and Schemes <title>The red book of Varieties and
<title>The red book of Varieties and
| publisher = [[Springer]]
| publisher = [[Springer]] Schemes</title>
Schemes</title>
| location = Berlin
| location = Berlin <publisher>Springer</publisher>
<publisher>Springer</publisher>
| date = 1999
| date = 1999 <year>1999</year>
<year>1999</year>
| page = 198
| page = 198 <doi>10.1007/b62130</doi>
<doi>10.1007/b62130</doi>
| doi = 10.1007/b62130
| doi = 10.1007/b62130 </record>
</record>
| isbn = 354063293X
| isbn = 354063293X
}}
}}
Wikipedia Data set CiteSeer Data set
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
5 / 19
13. Entity Matching
What is Entity Matching?
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
9 / 19
14. Entity Matching
What is Entity Matching?
Challenges
Entity Matching has quadratic runtime behavior
Entity Matching has high CPU- and memory demands
The definition of “what is similar” is domain-dependent
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
9 / 19
15. Entity Matching
Entity Matching Architecture
b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data
...
Source
Source
S22
S
bnn
b
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
10 / 19
16. Entity Matching
Entity Matching Architecture
b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data
...
Source
Source
S22
S
bnn
b
How can we improve the runtime of an EM task?
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
10 / 19
17. Entity Matching
Entity Matching Architecture
b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data
...
Source
Source
S22
S
bnn
b
Distributed Blocking
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
10 / 19
18. Entity Matching
Entity Matching Architecture
b11
b
Data
Data
Source
Source
S11
S b22
b Match
Match
Blocking
Blocking Matching
Matching Result
Result
R
R
b33
b
Data
Data
...
Source
Source
S22
S
bnn
b
Distributed Blocking Parallel Matching
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
10 / 19
19. MAXIM: Entity Matching
in the Cloud
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
11 / 19
20. MAXIM: Entity Matching in the Cloud
Requirements and Approach
Requirements
Efficient processing of semistructured data
Scalability to large datasets
Independency from specific similarity functions
Ability to easily add new similarity functions
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
12 / 19
21. MAXIM: Entity Matching in the Cloud
Requirements and Approach
Requirements
Efficient processing of semistructured data
Scalability to large datasets
Independency from specific similarity functions
Ability to easily add new similarity functions
Main Idea
Use MapReduce and ChuQL to process semistructured data
Use a search-based blocking to generate candidate pairs
Apply similarity functions to candidate pairs within a block
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
12 / 19
22. MAXIM: Entity Matching in the Cloud
Architecture
Search Node 1 Search Node 2 Search Node N
Engine Engine Engine
Data Node Data Node Data Node
...
Hadoop
Hadoop
Hadoop
Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker
Index Index Index
ChuQL Engine ChuQL Engine ChuQL Engine
HDFS
HDFS
Architecture
Hadoop cluster with up to 40 nodes
Each node runs a search engine and an attached full-text index
Each node runs an in-memory XQuery processor
Semistructured data is partitioned and placed on HDFS
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
13 / 19
23. MAXIM: Entity Matching in the Cloud
Processing Stages
Search Engines
Search Engines
HDFS
HDFS
Three Stages
Preparation Stage
Blocking Stage
Matching Stage
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
14 / 19
24. MAXIM: Entity Matching in the Cloud
Processing Stages
Search Engines
Search Engines
HDFS
HDFS
Transform
Extract Store into full-text Build
references references index XML index
Extract Wikipedia
Extract Wikipedia Index CiteSeerX
Index CiteSeerX
references
references records
records
Preparation Stage
Stage 1: Preparation Stage
Extracts references from Wikipedia
Reads and transforms records from CiteSeerX
Sends CiteSeerX data to local full-text index
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
14 / 19
25. MAXIM: Entity Matching in the Cloud
Processing Stages
Search Engines
Search Engines
HDFS
HDFS
Transform
Extract Store into full-text Build Retrieve Generate
references references Get query Store
index XML index references query response blocks
Extract Wikipedia
Extract Wikipedia Index CiteSeerX
Index CiteSeerX Generate Semantic
Generate Semantic
references
references records
records Block
Block
Preparation Stage Blocking Stage
Stage 2: Blocking Stage
Reads extracted references from HDFS
Probes full-text index to retrieve candidate publications
Assign candidate publications to block(s)
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
14 / 19
26. MAXIM: Entity Matching in the Cloud
Processing Stages
Search Engines
Search Engines
HDFS
HDFS
Transform Store
Extract Store into full-text Build Retrieve Generate Verify
references references Get query Store record
index XML index references query candidate
response blocks pairs
pairs
Extract Wikipedia
Extract Wikipedia Index CiteSeerX
Index CiteSeerX Generate Semantic
Generate Semantic Record pair generation
Record pair generation
references
references records
records Block
Block
Preparation Stage Blocking Stage Matching Stage
Stage 3: Matching Stage
Read blocks from HDFS
Generate candidate pairs and apply similarity functions
Store matching pairs and their similarity
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
14 / 19
27. MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References Indexing Publications
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
15 / 19
28. MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References Indexing Publications
Extraction
{{cite journal
| author1 = Hansjörg Zeller
| author2 = Jim Gray
| title = An Adaptive Hash Join Algorithm for Multi-User Environments
| journal = Proceedings of the 16th VLDB conference
| year = 1990
| pages = 186–197
}}
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
15 / 19
29. MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References Indexing Publications
Extraction
{{cite journal
| author1 = Hansjörg Zeller
| author2 = Jim Gray
| title = An Adaptive Hash Join Algorithm for Multi-User Environments
| journal = Proceedings of the 16th VLDB conference
| year = 1990
| pages = 186–197
}}
Transformation
<reference type=“journal“>
<author1>Hansjörg Zeller</author1>
<author2>Jim Gray</author2>
<title>An Adaptive Hash Join Algorithm for Multi-User
Environments</title>
<journal>Proceedings of the 16th VLDB conference</journal>
<year>1990</year>
<pages>186–197</pages>
</reference>
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
15 / 19
30. MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References Indexing Publications
HDFS
Extraction
{{cite journal
| author1 = Hansjörg Zeller
| author2 = Jim Gray
| title = An Adaptive Hash Join Algorithm for Multi-User Environments
| journal = Proceedings of the 16th VLDB conference
| year = 1990
| pages = 186–197
}}
Transformation
<reference type=“journal“>
<author1>Hansjörg Zeller</author1>
<author2>Jim Gray</author2>
<title>An Adaptive Hash Join Algorithm for Multi-User
Environments</title>
<journal>Proceedings of the 16th VLDB conference</journal>
<year>1990</year>
<pages>186–197</pages>
</reference>
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
15 / 19
31. MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References Indexing Publications
HDFS
Extraction
Read and Transformation
{{cite journal
| author1 = Hansjörg Zeller
| author2 = Jim Gray
<doc>
| title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field>
| journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and
| year = 1990 Connecting Words In Language
Generation</field>
| pages = 186–197 <field name="author">Bonnie Dorr</field>
}} <field name="description">Generating language
...</field>
</doc>
Transformation
<reference type=“journal“>
<author1>Hansjörg Zeller</author1>
<author2>Jim Gray</author2>
<title>An Adaptive Hash Join Algorithm for Multi-User
Environments</title>
<journal>Proceedings of the 16th VLDB conference</journal>
<year>1990</year>
<pages>186–197</pages>
</reference>
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
15 / 19
32. MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References Indexing Publications
HDFS
Extraction
Read and Transformation
{{cite journal
| author1 = Hansjörg Zeller
| author2 = Jim Gray
<doc>
| title = An Adaptive Hash Join Algorithm for Multi-User Environments <field name="id">10.1.1.49.2550</field>
| journal = Proceedings of the 16th VLDB conference <field name="title">Selecting Tense, Aspect, and
| year = 1990 Connecting Words In Language
Generation</field>
| pages = 186–197 <field name="author">Bonnie Dorr</field>
}} <field name="description">Generating language
...</field>
</doc>
Transformation
Indexing
<reference type=“journal“>
<author1>Hansjörg Zeller</author1>
<author2>Jim Gray</author2>
<title>An Adaptive Hash Join Algorithm for Multi-User
Environments</title> Lucene
Lucene
<journal>Proceedings of the 16th VLDB conference</journal> Index
Index
<year>1990</year>
<pages>186–197</pages>
</reference>
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
15 / 19
33. MAXIM: Entity Matching in the Cloud
Stage 2: Blocking Stage
Block generation
Each reference generates a set of candidate publications
Each candidate publication is inserted into all blocks, which are
listed in reference
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
16 / 19
34. MAXIM: Entity Matching in the Cloud
Stage 2: Blocking Stage
Block generation
Each reference generates a set of candidate publications
Each candidate publication is inserted into all blocks, which are
listed in reference
Example
Hashing
<citation>
<citation>
<id>26334893</id>
<id>26334893</id>
<citation>
<cat>Search engine optimization</cat>
<cat>Search engine optimization</cat>
<id>26334893</id> 10.0.1.1.124
<cat>Hashing</cat> search algorithms</cat>
<cat>Internet
<cat>Internet search algorithms</cat> Search Engine
<cat>Link analysis</cat>
<cat>Link analysis</cat>
<cat>Join algorithms</cat>
<ref> <ref>
<ref> 10.0.1.11.23
<type>journal</type>
<type>journal</type>
<type>journal</type>
<author>Taher Haveliwala</author>
<author>Taher Haveliwala</author>
<author>Hansjörg Zeller</author> send result
<author>Jim Gray</author>
<year>2003</year>
<year>2003</year> Full-Text
<year>1990</year> send query Index
<pages>56-70</pages>
<pages>56-70</pages>
<pages>186-197</pages> Eigenvalue
<title>The Second
<title>The Second Eigenvalue send result
<title>An AdaptiveGoogle Matrix</title>
ofof the Hash JoinMatrix</title>
the Google Algorithm Join
for Multiuser Environments</title>
<journal>Stanford University
<journal>Stanford University
<journal>Proceedings of the 16th VLDB algorithms
Technical Report</journal>
Technical Report</journal>
conference</journal>
</ref>
</ref></ref> 10.0.1.1.124
</citation>
</citation>
</citation>
10.0.7.23.14
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
16 / 19
35. MAXIM: Entity Matching in the Cloud
Stage 2: Blocking Stage
Distributed Search in MAXIM
(a) Send HTTP request (query) Search Node 1 (c)
Engine
(b) HTTP response (partial result) Data Node
Hadoop
(c) Collect partial results Full-text Task Tracker
Index
ChuQL Engine
(a)
)
(a
(a)
(a)
(b)
(b)
(b)
(b)
Search Node 2 Search Node 3 Search Node 4 Search Node 5
Engine Engine Engine Engine
Data Node Data Node Data Node Data Node
Hadoop
Hadoop
Hadoop
Hadoop
Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker Full-text Task Tracker
Index Index Index Index
ChuQL Engine ChuQL Engine ChuQL Engine ChuQL Engine
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
16 / 19
36. MAXIM: Entity Matching in the Cloud
Stage 3: Matching Stage
Applies user-defined similarity functions to candidate pairs
Each attribute can be evaluated by a specific similarity function
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
17 / 19
37. MAXIM: Entity Matching in the Cloud
Stage 3: Matching Stage
Applies user-defined similarity functions to candidate pairs
Each attribute can be evaluated by a specific similarity function
Number of candidate pairs
n
CP = Ci ∗ Ri (1)
i=1
n - # of blocks in B1 , . . . , Bn
Ri - # of references in block Bi
Ci - # of candidate publications in block Bi
CP - # of candidate pairs to verify
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
17 / 19
38. Summary
Summary
Wikipedia provides many opportunities for research
Need for efficiently processing semistructured data is increasing
Entity Matching is critical for data integration and data cleaning
Entity Matching is difficult to parallelize due to unbalanced data
partitions
MAXIM parallelizes EM by building blocks of similar records in a
classification fashion
MAXIM allows to define own similarity functions and computation
functions without changing the algorithm
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
18 / 19
39. “Everything that can be invented has been invented.”
(Charles H. Duell, Commissioner, U.S. Office of Patents, 1899)
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
19 / 19
40. Experiments
Scaleup and Speedup
9 2
Ideal Ideal
INDEXING-2000 1.8 EXTRACTING-2000
8 EXTRACTING-2000 INDEXING-2000
Speedup = Base Time / New Time
Scaleup = Base Time / New Time
BLOCKING 1.6
7 MATCHING
1.4
6
1.2
5 1
0.8
4
0.6
3
0.4
2
0.2
1 0
5 10 20 40 5 10 20 40
Number of nodes Number of nodes
(a) Speedup for all stages (b) Scaleup for preparation stage
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
20 / 23
41. Experiments
Query Performance
900
RESULTCOUNT-50
Avg. Query Response Time (ms)
800 RESULTCOUNT-100
RESULTCOUNT-150
700 RESULTCOUNT-200
600
500
400
300
200
100
0
5 10 20 40
Number of Nodes
Figure: Query Performance for different result set sizes and cluster sizes.
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
21 / 23
42. Experiments
Blocking Accuracy
1.2
Ideal
WRONG-ORDER
1.1 MISPLACED-END
MISPLACED-ANY
MISSING
1
Accuracy
0.9
0.8
0.7
0.6
0.5
0 0.25 0.5 0.75 1.0
Variance
Figure: Blocking accuracy for different typographical error classes.
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
22 / 23
43. Experiments
Number of Candidate Pairs
5.5e+006
RSCOUNT-50
5e+006 RSCOUNT-100
RSCOUNT-150
4.5e+006 RSCOUNT-200
Number of candidate pairs
4e+006
3.5e+006
3e+006
2.5e+006
2e+006
1.5e+006
1e+006
500000
0
0.0 0.1 0.25 0.5 0.75 1.0
Variance
Figure: Number of candidate pair verifications in the matching stage.
Marcus Paradies Entity Matching for Semistructured Data in the Cloud
23 / 23