SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Integrating Product Data from
Websites offering Microdata
Markup
School of Business Informatics and Mathematics
Petar Petrovski, Volha Bryl, Christian Bizer
Data and Web Science Research Group
University of Mannheim, Germany
Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
2Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
HTML-embedded Data
More and more Websites semantically markup the
content of their HTML pages.
Microformats
Microdata
RDFa
3Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Schema.org
• ask site owners to embed
data to enrich search results.
• 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, …
• Encoding: Microdata or RDFa
4Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Usage of Schema.org Data @ Google
Data snippets
within
search results
Data snippets
within
info boxes
5Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Websites Containing Structured Data
(November 2013)
1.7 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13%)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26%).
http://webdatacommons.org/structureddata/
Google, October 2013:
15% of all websites provide structured data.
6Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Top Classes, Microdata (2013)
• schema = Schema.org
• datavoc = Google‘s
Rich Snippet Vocabulary
7Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Example: Microdata, Local Business
8Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Example: Microdata, Product
School of Business Informatics and Mathematics
The Data Integration Pipeline
• Objective: integrate all data found on the web
describing a specific entity (e.g. product or organization)
• Motivation: enables creation of powerful applications, e.g.
comparison shopping portals
• Use case: product data
• Implemented Pipeline:
10Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
11Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Web Data Commons Extraction
Framework
• Web Data Commons project: extracts structured data from
the Common Crawl
– http://webdatacommons.org/
– http://commoncrawl.org/
• Code available at:
– https://subversion.assembla.com/svn/commondata/
– Based on Anything To Triples (any23) library for extracting
structured data: http://any23.apache.org
• Common Crawl 2012
– 3 billion HTML pages, 40.6 million websites
– 7.3 billion statements describing 1.15 billion things
– 9.4 million product offers from 9240 e-shops
Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Looking Deeper into E-Commerce Data
Microdata Product (2013)
13Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Looking Deeper into E-Commerce Data
Microdata Product (2012)
Example: Title and Description
Title
Description
AppleMacBook Air MC968/A 11.6-Inch Laptop
Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD
Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000
enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best
resolution…
Title
Description
The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD
and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics,
IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon
Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4
GB, 64 GB, Mac OS X Lion 10.7
Various abbreviations can be
found describing same features Often imprecise values due to rounding
in numeric values can be found
Different descriptions follow
different levels of detail
Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
16Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Product Classification
• Starting from 9.4 million products:
• Products with English descriptions with length grater than 20 words
=> 1,986,359 products from 9,240 e-shops
• Training set
– 18,000 labeled products, 9 classes
• Training the model
– Naïve Bayes Classifier
• Features generation
– 4 step process – tokenizing and removing stop words, pruning,
n-grams, TF-IDF
– ~3600 features
17Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Classification Performance
Category Precision % Recall % #
Books 86.58 87.95 233,249
Movies, Music & Games 89.81 70.63 186,832
Electronics & Computers 92.98 88.00 219,118
Home, Garden & Tools 73.81 60.78 186,495
Grocery, Health & Beauty 70.20 72.86 120,573
Toys, Kids, Baby & Pets 75.00 64.85 114,236
Clothing, Shoes & Jewelry 88.56 89.93 206,315
Sports & Outdoors 72.83 67.90 143,156
Automotive & Industrial 73.06 65.50 168,567
Average 80.31 74.26 1,578,541
18Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
The offers originate from 9,240 e-shops
Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
19Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Product Feature Extraction
• Low precision (69%) for identity resolution without product feature
extraction
– Used later as a baseline for identity resolution
• We developed the Free Text Preprocessor
– Makes the data more structured by extracting new property-
value pairs from free-text properties
– https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor
20Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
21Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .
22Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Silk Free Text Preprocessor by Example
<http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" .
<http://wdc.org/resource/2> <http://schema.org/Product/description>
"Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g.
Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" .
<http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" .
<http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" .
<http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" .
<http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" .
Free Text Preprocessor
Specification
23Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Extractors – Bag-of-words
• Learning
• Creating a list of words for every feature in the training set
• Extraction
• Matching tokens against the learned lists
• Pros
• Good for extracting nominal and numerical (with units of measurement) attributes
• Cons
• Bad for extracting multi-token values
• Inconclusive for values that refer to more than one feature
Brand
Storage
Display
Samsung Benq Apple Cannon …
64 GB megabytes 512GB …
42-inch 3.5-inches Inches 15.24cm …
24Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Extractors – Feature-Value Pairs
Learns feature-value pairs from the structured data
Extraction
• Tagging – taking n-grams up to 4 and matching against the values from the training set
• Parsing – taking the combination of feature-value pairs that best describes an object
from the training dataset
• Pros
• Extracting multi-token values
Cons
• Inconclusive for values that refer to more than one feature
<Model, Asus EEE 10.1 Inch>
<Processor, 1.66 GHz Intel Atom N445>
<Display, 10.1-inches>
..
<Model, Panasonic Viera>
<Display, 42-Inch>
25Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Extractors – Manual Configuration
Manually configure features and extraction methods
1. Regular expressions
• E.g. Processor - d*.?d+GHz
2. Dictionary search
• E.g. Dictionary of brands (Samsung, Panasonic, Lenovo, Apple)
• Pros
• Extraction process can be fine-tuned according to the data
• Good solution when no training (structured) data are available
• Cons
• Needs domain knowledge
• Non-trivial to efficiently pick extraction methods manually
26Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Extraction Experiments
• Dataset for extraction 5,000 electronic
products from WDC
• Training dataset (structured data)
– 20 electronics products Amazon dataset
27Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Extraction Accuracy
Brand Model Storage Display Processor Dimension
iPod Nano .92 .98 .86 .49 .12 .78
Galaxy SII .72 .87 .89 .81 .40 .91
GalaxyTab 7.7 .80 .92 .89 .85 .72 .93
Ixus 120IS 1 .96 N/A .89 N/A .56
Vaio VPC .99 .65 .81 .77 .73 .32
Viera 42 .95 .72 N/A .82 N/A .64
Sandisk 1 1 .85 N/A N/A .31
• Extraction using Combination configuration
(bag-of-words for Brand, Storage and Display;
feature-value pairs for Model and Dimension;
custom regular expression for the Processor)
28Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
29Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Identity Resolution
• We used Silk – a tool for discovering relationships
between data items within different linked data
sources
Provides a expressive language for defining linkage rules
Uses genetic programming to learn linkage rules
Has shown high performance on various datasets
https://www.assembla.com/spaces/silk/wiki/Home
30Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Identity Resolution Experiments
• Gold standard: 5,000 links manually annotated
• 2,500 positive/2,500 negative
• 20 electronics products Amazon dataset (reference set)
• Experiment on 5 configurations
– Baseline (no feature extraction step)
– Bag-of-words
– Feature-value pairs
– Manual configuration
– Combinations
31Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Silk Output: Learned Linkage Rule
:Property
wdc:Model
:Transform
lowerCase
:Comparison
func = Levensthein
threhold = 1.134
:Property
wdc:Display
:Aggregation
func= max
:Aggregation
func= average
:Transform
lowerCase
:Property
amazon:Model
:Transform
tokenize
:Transform
tokenize
:Property
amazon:Display
:Comparison
func = Jaccard
threhold = 0.23
:Comparison
func = Jaccard
threhold = 0.02
:Property
amazon:Storage
:Property
wdc:Storage
32Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Identity Resolution Results
Precision % Recall % F-Measure %
Baseline 69 90 78.1
Bag-of-words 75 82 77.9
Feature-value pairs 80 77 78.4
Custom 82 80 80.9
Combination 85 80 82.4
33Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Outline
1. HTML-embedded Data on the Web
2. The Data Integration Pipeline
1. Microdata extraction
2. Classification
3. Feature extraction
4. Identity resolution
5. Data Fusion
3. Conclusions
34Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Data Fusion
• Input: clusters of products after identity resolution
• Properties worth fusing/combining
– AggregateRating and Review
35Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Fusion Results
Product Offers Reviews Ratings
iPod Nano 8GB 829 84 0
iPhone 4 16GB 624 35 52
Sony Ericsson Xperia Mini 450 31 12
iPad 16GB 423 40 48
Motorola XOOM 32GB 270 12 0
Samsun Galaxy SII 142 8 0
36Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Conclusions
• By using Microdata, thousands of websites help us to
understand their content
• We have implemented the 5-step data integration pipeline
– From Microdata markup to an integrated dataset
• A newly introduced feature extraction step is crucial for the
precision of data integration
– Identity resolution precision increases from 69% to 85%
• Future work
– Automatically learning regular expressions
– Automatically discovering combinations of extractors
37Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
Questions?
38Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer

Más contenido relacionado

Similar a Integrating Product Data from Microdata Markup

Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)Abdelkrim Boujraf
 
GraphTour - Neo4j Database Overview
GraphTour - Neo4j Database OverviewGraphTour - Neo4j Database Overview
GraphTour - Neo4j Database OverviewNeo4j
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark newAnam Mahmood
 
Building Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceBuilding Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceRaymond Gao
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Sitecore Personalization on websites cached on CDN servers
Sitecore Personalization on websites cached on CDN serversSitecore Personalization on websites cached on CDN servers
Sitecore Personalization on websites cached on CDN serversAnindita Bhattacharya
 
Discover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementDiscover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementMárton Kodok
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
 
Description of IBM Industrie 4.0 StarterPack
Description of IBM Industrie 4.0 StarterPackDescription of IBM Industrie 4.0 StarterPack
Description of IBM Industrie 4.0 StarterPackPlamen Kiradjiev
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...DataBench
 
Mihai tataran developing modern web applications
Mihai tataran   developing modern web applicationsMihai tataran   developing modern web applications
Mihai tataran developing modern web applicationsITCamp
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Databricks
 
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Debmalya Biswas
 

Similar a Integrating Product Data from Microdata Markup (20)

contentDM
contentDMcontentDM
contentDM
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
 
GraphTour - Neo4j Database Overview
GraphTour - Neo4j Database OverviewGraphTour - Neo4j Database Overview
GraphTour - Neo4j Database Overview
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Building Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceBuilding Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and Salesforce
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Sitecore Personalization on websites cached on CDN servers
Sitecore Personalization on websites cached on CDN serversSitecore Personalization on websites cached on CDN servers
Sitecore Personalization on websites cached on CDN servers
 
Discover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statementDiscover BigQuery ML, build your own CREATE MODEL statement
Discover BigQuery ML, build your own CREATE MODEL statement
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Description of IBM Industrie 4.0 StarterPack
Description of IBM Industrie 4.0 StarterPackDescription of IBM Industrie 4.0 StarterPack
Description of IBM Industrie 4.0 StarterPack
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Big data hadoop
Big data hadoopBig data hadoop
Big data hadoop
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
Session 2 - A Project Perspective on Big Data Architectural Pipelines and Ben...
 
Mihai tataran developing modern web applications
Mihai tataran   developing modern web applicationsMihai tataran   developing modern web applications
Mihai tataran developing modern web applications
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
 
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...Building an enterprise Natural Language Search Engine with ElasticSearch and ...
Building an enterprise Natural Language Search Engine with ElasticSearch and ...
 
CODE IGNITER
CODE IGNITERCODE IGNITER
CODE IGNITER
 

Último

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 

Último (20)

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

Integrating Product Data from Microdata Markup

  • 1. Integrating Product Data from Websites offering Microdata Markup School of Business Informatics and Mathematics Petar Petrovski, Volha Bryl, Christian Bizer Data and Web Science Research Group University of Mannheim, Germany
  • 2. Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. Microdata extraction 2. Classification 3. Feature extraction 4. Identity resolution 5. Data Fusion 3. Conclusions 2Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 3. HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Microformats Microdata RDFa 3Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 4. Schema.org • ask site owners to embed data to enrich search results. • 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, … • Encoding: Microdata or RDFa 4Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 5. Usage of Schema.org Data @ Google Data snippets within search results Data snippets within info boxes 5Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 6. Websites Containing Structured Data (November 2013) 1.7 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13%) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26%). http://webdatacommons.org/structureddata/ Google, October 2013: 15% of all websites provide structured data. 6Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 7. Top Classes, Microdata (2013) • schema = Schema.org • datavoc = Google‘s Rich Snippet Vocabulary 7Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 8. Example: Microdata, Local Business 8Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 9. Example: Microdata, Product School of Business Informatics and Mathematics
  • 10. The Data Integration Pipeline • Objective: integrate all data found on the web describing a specific entity (e.g. product or organization) • Motivation: enables creation of powerful applications, e.g. comparison shopping portals • Use case: product data • Implemented Pipeline: 10Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 11. Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. Microdata extraction 2. Classification 3. Feature extraction 4. Identity resolution 5. Data Fusion 3. Conclusions 11Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 12. Web Data Commons Extraction Framework • Web Data Commons project: extracts structured data from the Common Crawl – http://webdatacommons.org/ – http://commoncrawl.org/ • Code available at: – https://subversion.assembla.com/svn/commondata/ – Based on Anything To Triples (any23) library for extracting structured data: http://any23.apache.org • Common Crawl 2012 – 3 billion HTML pages, 40.6 million websites – 7.3 billion statements describing 1.15 billion things – 9.4 million product offers from 9240 e-shops Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 13. Looking Deeper into E-Commerce Data Microdata Product (2013) 13Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 14. Looking Deeper into E-Commerce Data Microdata Product (2012)
  • 15. Example: Title and Description Title Description AppleMacBook Air MC968/A 11.6-Inch Laptop Faster Flash Storage with 64 GB Solid State Drive and USB 3.0. 720p FaceTime HD Camera. The new 1.6 GHz Intel Core i5 Processor with Intel HD Graphics 3000 enabling beautiful rendering and 4GB DDR3 RAM. 11.6” LED display with the best resolution… Title Description The MacBook Air MC 968/A powered by Intel Core i5(1.6GHz, 3MB L3). 64 GB SSD and 4096 MB of DDR3 RAM. 29.464cm (11.6”) TFT 1366x768, Intel HD Graphics, IEEE 802.11a/b/g, Bluetooth 4.0, FaceTme camera, OS X LIon Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 4 GB, 64 GB, Mac OS X Lion 10.7 Various abbreviations can be found describing same features Often imprecise values due to rounding in numeric values can be found Different descriptions follow different levels of detail Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 16. Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. Microdata extraction 2. Classification 3. Feature extraction 4. Identity resolution 5. Data Fusion 3. Conclusions 16Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 17. Product Classification • Starting from 9.4 million products: • Products with English descriptions with length grater than 20 words => 1,986,359 products from 9,240 e-shops • Training set – 18,000 labeled products, 9 classes • Training the model – Naïve Bayes Classifier • Features generation – 4 step process – tokenizing and removing stop words, pruning, n-grams, TF-IDF – ~3600 features 17Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 18. Classification Performance Category Precision % Recall % # Books 86.58 87.95 233,249 Movies, Music & Games 89.81 70.63 186,832 Electronics & Computers 92.98 88.00 219,118 Home, Garden & Tools 73.81 60.78 186,495 Grocery, Health & Beauty 70.20 72.86 120,573 Toys, Kids, Baby & Pets 75.00 64.85 114,236 Clothing, Shoes & Jewelry 88.56 89.93 206,315 Sports & Outdoors 72.83 67.90 143,156 Automotive & Industrial 73.06 65.50 168,567 Average 80.31 74.26 1,578,541 18Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer The offers originate from 9,240 e-shops
  • 19. Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. Microdata extraction 2. Classification 3. Feature extraction 4. Identity resolution 5. Data Fusion 3. Conclusions 19Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 20. Product Feature Extraction • Low precision (69%) for identity resolution without product feature extraction – Used later as a baseline for identity resolution • We developed the Free Text Preprocessor – Makes the data more structured by extracting new property- value pairs from free-text properties – https://www.assembla.com/spaces/silk/wiki/Silk_Free_Text_Preprocessor 20Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 21. Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . <http://wdc.org/resource/2> <http://schema.org/Product/description> "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . 21Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 22. Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . <http://wdc.org/resource/2> <http://schema.org/Product/description> "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . <http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" . <http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" . <http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" . <http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" . 22Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 23. Silk Free Text Preprocessor by Example <http://wdc.org/resource/2> <http://schema.org/Product/title> "Apple iPod nano (8 GB, 6th generation, Graphite)" . <http://wdc.org/resource/2> <http://schema.org/Product/description> "Memory size: 8GB. Colour: Graphite Generation: 6th generation. Memory type: Integrated. Weight: 21.1g. Radio: With Radio. Audio/Video formats: AAC, AIFF, Audible, MP3, WAV, VBR Display: 1.5-inch" . <http://wdc.org/resource/2> <http://schema.org/Product/Brand> "Apple" . <http://wdc.org/resource/2> <http://schema.org/Product/Model> "iPod nano" . <http://wdc.org/resource/2> <http://schema.org/Product/Storage> "8GB" . <http://wdc.org/resource/2> <http://schema.org/Product/Display> "1.5-inch" . Free Text Preprocessor Specification 23Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 24. Extractors – Bag-of-words • Learning • Creating a list of words for every feature in the training set • Extraction • Matching tokens against the learned lists • Pros • Good for extracting nominal and numerical (with units of measurement) attributes • Cons • Bad for extracting multi-token values • Inconclusive for values that refer to more than one feature Brand Storage Display Samsung Benq Apple Cannon … 64 GB megabytes 512GB … 42-inch 3.5-inches Inches 15.24cm … 24Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 25. Extractors – Feature-Value Pairs Learns feature-value pairs from the structured data Extraction • Tagging – taking n-grams up to 4 and matching against the values from the training set • Parsing – taking the combination of feature-value pairs that best describes an object from the training dataset • Pros • Extracting multi-token values Cons • Inconclusive for values that refer to more than one feature <Model, Asus EEE 10.1 Inch> <Processor, 1.66 GHz Intel Atom N445> <Display, 10.1-inches> .. <Model, Panasonic Viera> <Display, 42-Inch> 25Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 26. Extractors – Manual Configuration Manually configure features and extraction methods 1. Regular expressions • E.g. Processor - d*.?d+GHz 2. Dictionary search • E.g. Dictionary of brands (Samsung, Panasonic, Lenovo, Apple) • Pros • Extraction process can be fine-tuned according to the data • Good solution when no training (structured) data are available • Cons • Needs domain knowledge • Non-trivial to efficiently pick extraction methods manually 26Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 27. Extraction Experiments • Dataset for extraction 5,000 electronic products from WDC • Training dataset (structured data) – 20 electronics products Amazon dataset 27Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 28. Extraction Accuracy Brand Model Storage Display Processor Dimension iPod Nano .92 .98 .86 .49 .12 .78 Galaxy SII .72 .87 .89 .81 .40 .91 GalaxyTab 7.7 .80 .92 .89 .85 .72 .93 Ixus 120IS 1 .96 N/A .89 N/A .56 Vaio VPC .99 .65 .81 .77 .73 .32 Viera 42 .95 .72 N/A .82 N/A .64 Sandisk 1 1 .85 N/A N/A .31 • Extraction using Combination configuration (bag-of-words for Brand, Storage and Display; feature-value pairs for Model and Dimension; custom regular expression for the Processor) 28Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 29. Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. Microdata extraction 2. Classification 3. Feature extraction 4. Identity resolution 5. Data Fusion 3. Conclusions 29Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 30. Identity Resolution • We used Silk – a tool for discovering relationships between data items within different linked data sources Provides a expressive language for defining linkage rules Uses genetic programming to learn linkage rules Has shown high performance on various datasets https://www.assembla.com/spaces/silk/wiki/Home 30Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 31. Identity Resolution Experiments • Gold standard: 5,000 links manually annotated • 2,500 positive/2,500 negative • 20 electronics products Amazon dataset (reference set) • Experiment on 5 configurations – Baseline (no feature extraction step) – Bag-of-words – Feature-value pairs – Manual configuration – Combinations 31Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 32. Silk Output: Learned Linkage Rule :Property wdc:Model :Transform lowerCase :Comparison func = Levensthein threhold = 1.134 :Property wdc:Display :Aggregation func= max :Aggregation func= average :Transform lowerCase :Property amazon:Model :Transform tokenize :Transform tokenize :Property amazon:Display :Comparison func = Jaccard threhold = 0.23 :Comparison func = Jaccard threhold = 0.02 :Property amazon:Storage :Property wdc:Storage 32Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 33. Identity Resolution Results Precision % Recall % F-Measure % Baseline 69 90 78.1 Bag-of-words 75 82 77.9 Feature-value pairs 80 77 78.4 Custom 82 80 80.9 Combination 85 80 82.4 33Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 34. Outline 1. HTML-embedded Data on the Web 2. The Data Integration Pipeline 1. Microdata extraction 2. Classification 3. Feature extraction 4. Identity resolution 5. Data Fusion 3. Conclusions 34Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 35. Data Fusion • Input: clusters of products after identity resolution • Properties worth fusing/combining – AggregateRating and Review 35Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 36. Fusion Results Product Offers Reviews Ratings iPod Nano 8GB 829 84 0 iPhone 4 16GB 624 35 52 Sony Ericsson Xperia Mini 450 31 12 iPad 16GB 423 40 48 Motorola XOOM 32GB 270 12 0 Samsun Galaxy SII 142 8 0 36Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 37. Conclusions • By using Microdata, thousands of websites help us to understand their content • We have implemented the 5-step data integration pipeline – From Microdata markup to an integrated dataset • A newly introduced feature extraction step is crucial for the precision of data integration – Identity resolution precision increases from 69% to 85% • Future work – Automatically learning regular expressions – Automatically discovering combinations of extractors 37Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer
  • 38. Questions? 38Integrating Product Data from Microdata Markup. Petar Petrovski, Volha Bryl, Chris Bizer