SlideShare a Scribd company logo
1 of 58
T H I S I S A M I X E D D E C K W I T H S L I D E S F R O M P R O F . D R . C H R I S
B I Z E R , O L I V E R G R I S E L , S O R E N A U E R A N D J E S S E W A N G
THE WEB OF DATA
AGENDA
Introduction to the Web of (Open Semantic) Data
Linked Open Data and 5-star Data Principles
DBpedia – Query Wikipedia as a database
Linked Data Integration Framework
Common Crawl Database
Web Data Commons
Summary
11/7/11
“To a computer, then, the web is
a flat, boring world devoid of meaning”
Tim Berners
Lee, http://www.w3.org/Talks/WWW94Tim/
11/7/11
“This is a pity, as in fact documents on the web
describe real objects and
imaginary concepts, and give
particular relationships between them”
Tim Berners Lee,
http://www.w3.org/Talks/WWW94Tim/
“Adding semantics to the web involves two things:
allowing documents which have information
in machine-readable forms, and allowing links to
be created with relationship values.”
Tim Berners
Lee, http://www.w3.org/Talks/WWW94Tim/
11/7/11
THE WEB OF DATA - HOW?
RDF / Triple Stores / SPARQL
Graph stores with dynamic schemas
Strong interoperability
JSON-LD
Upgrade your JSON with scoped vocabularies
Web / Mobile / JS developer friendly
RDFa + schema.org & rNews
Publish annotation in structured markup
Vocabulary understood by Search Engines
11/7/11
THE WEB OF DATA - WHAT?
Linked Open Data
Started with DBpedia – Wikipedia as database
In 2011.09, LOD cloud has near 300 datasets
Web Data Commons
Based on Common Crawl Database
LOD + OpenGraph + Schema.org
Knowledge-Bases?
Can we be a valuable contributor?
LINKED DATA PARADIGM
Use URIs as names for things
Use HTTP URIs so that people can
look up those names.
When someone looks up a
URI, provide useful information.
Include links to other URIs. so that
they can discover more things.
5 ★ OPEN DATA
Tim Berners-Lee, inventor of the Web and Linked Data
initiator, suggested a 5 star deployment scheme for Open
Data.
Here, we give examples for each step of
the stars and explain costs
and benefits that come
along with it.
http://5stardata.info/
AND IT STARTS WITH…
DBPEDIA
Joined project to
• create a huge, multi-lingual
knowledge base
• by extracting structured
information from Wikipedia
• make the knowledge base
available on the Web
as Linked Data under an open
license
WE HELPED DBPEDIA (3.5, 2010.4)
• Extraction framework
completely rewritten
• Mapping language
redesigned
• Hosted on a wiki
http://mappings.dbpedi
a.org
• A lot more things
extracted
• … 0
200
400
600
800
1000
1200
DBPEDIA 3.4 DBPEDIA 3.5
Total Triples
11/7/11
2007 2008
2009 2010
2011.0
9
DBPEDIA 3.8 (NOW)
• Structured Information in Wikipedia
• infoboxes
• geo-coordinates
• categorization of articles
• inter-language links
• links to images and external webpages
• titles and abstracts
• tables and lists
• Currently 111 localized editions
Category Instances Statements
Distinct
Properties
Person 871,630 18,323,794 6,195,234
Artist 100,793 3,723,440 998,616
Actor 25,340 1,070,066 247,690
Musical Artist 46,364 2,069,152 550,225
Athlete 217,067 6,373,136 1,853,233
Politician 41,126 1,407,548 454,209
Place 643,260 24,698,893 8,026,305
Building 65,355 1,058,610 530,010
Airport 11,675 352,377 138,944
Bridge 3,425 66,968 34,470
Skyscraper 68 3,091 719
Populated Place 424,291 20,565,679 6,212,991
River 26,892 681,782 208,146
Organisation 206,670 4,940,190 2,029,620
Band 29,101 1,126,744 298,743
Company 48,989 1,048,251 445,758
Educ.Institution 43,250 958,257 493,792
Work 360,808 9,649,228 3,566,511
Book 44,339 1,111,960 408,724
Film 75,067 2,663,487 787,129
Musical Work 160,383 4,116,625 1,635,655
Album 122,729 3,400,942 1,224,746
Single 42,393 1,226,636 534,023
Software 28,930 731,138 242,411
Television Show 24,784 565,136 282,594 0 10,000,000 20,000,000 30,000,000
Person
Artist
Actor
Musical Artist
Athlete
Politician
Place
Building
Airport
Bridge
Skyscraper
Populated Place
River
Organisation
Band
Company
Educ.Institution
Work
Book
Film
Musical Work
Album
Single
Software
Television Show
Distinct
Properties
Statements
Instances
CROSS-LANGUAGE
OVERLAP
CONSUMING LINKED DATA
Browsers
• LOD Cloud
http://datahub.io
• Tabulator
• Disco
• Linked Open Data
Explorer
• Marbles
• ObjectViewer
Search Engines
• Sameas.org
• Sindice
• Sig.ma
• LOD Cache (Virtuoso
by OpenLinkSoftware)
• SWSE - DERI
• VisiNav
• Falcon
• Swoogle
LDIF – LINKED DATA INTEGRATION
FRAMEWORK
• Single Machine /
Hadoop Version
• tested with 3.6 billion
RDF quads
A SILK LINKAGE RULE
LEARNING LINKAGE RULES
USING GENETIC PROGRAMMING
 based on existing reference
links
 GenLink learns
 comparisons
 aggregations
 transformations
 weights
 instead of subtree
crossover, we use a set of
custom crossover operators
Aggregation Crossover
Transformation Crossover
RESULTS FOR THE CORA EVALUATION
DATA SET
 Citations to research papers from the Cora research paper search
engine
 Attributes: Title, Author, Venue, Date of publication
 Reference Links: 1600
 GenLink achieved an F-measure 96.6% against the validation set.
 Carvalho et al. report an F-measure of 91.0 % against the validation set
(last line).
LEARNED RULE
Robert Isele and Christian Bizer: Learning Expressive Linkage
Rules using Genetic Programming. PVLDB 5(11):1638-1649, 2012
ACTIVE LEARNING OF LINKAGE RULES
• Query Strategy: Select the link candidate for which the
linkage rules in the current population disagree the most.
STRUCTURED DATA ON THE
WEB
WE HAVE THE TOOLS NOW
HTML-EMBEDDED STRUCTURED DATA
ON THE WEB
More and more Websites semantically
markup the content of their HTML pages.
Microformats
Microdata
RDFa
MICROFORMATS
• Microformat effort dates back to 2003
• Small set of fixed formats
• hcard : people, companies, organizations, and places
• XFN : relationships between people
• hCalendar : calendaring and events
• hListing : small-ads; classifieds
• hReview : reviews of products, businesses, events
• Shortcoming of Microformats
• can not represent any kind of data.
• indexed by Google and Yahoo since 2009
RDFA
• serialization format for embedding RDF data
into HTML pages
• proposed in 2004, W3C Recommendation in 2008
• can be used together with any vocabulary
• can assign URIs as global primary keys to entities
OPEN GRAPH PROTOCOL
• allows site owners to determine how
entities are described in Facebook
• relies on RDFa for encoding data in HTML pages
• available since April 2010
MICRODATA
• alternative technique for embedding structured data
• proposed in 2009 by WHATWG as part of HTML5 work
• tries to be simpler than RDFa (5 new attributes instead of
8)
• W3C currently tries to reconcile the two alternative
proposals
SCHEMA.ORG
• ask site owners to embed
data to enrich search results.
• 200+ Types:
Event, Organization, Person, Place, Product, Review
• Encoding: Microdata or alternatively RDFa
USAGE OF SCHEMA.ORG DATA @
GOOGLE
Answers to
fact queries
Data snippets
within
search results
Data tables
within
search results
THE COMMON CRAWL CORPORA
• Provides two web corpora on Amazon S3
• 2009/2010 Corpus: 2.5 billion HTML pages
• June 2012 Corpus: 3.0 billion HTML pages
• The June 2012 Corpus
• unique HTML pages: 3,005,629,093
• pay-level-domains (PLDs): 40.6 million
• size of the corpus in compressed form: 48 terabyte
• Crawler uses PageRank to decide which pages to retrieve
snapshot of the popular part of the Web
number of pages per site varies widely
• youtube.com: 93.1 million pages
• 37.5 million PLDs with less than 100 pages
LOOKUP INDEX
WEB DATA COMMONS
• WebDataCommons.org Project
• extracts all Microformat, Microdata, RDFa data from the Common Crawl
• provides the extracted data for free download
• Two extractions runs
• 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples
• 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples
• Jointed project of
THE WDC EXTRACTION FRAMEWORK
• 700.000 input files queued in SQS
• EC2 workers take tasks from SQS
• Workers read and write S3 buckets
S3
SQS
42
EC2
...
42 43 ...
CC R42 R43 ...
WDC
Workers
 100 spot instances of type c1.xlarge
(7G RAM, 8 cores)
 5600 machine/hours
 398 US$
WEBSITES CONTAINING
STRUCTURED DATA (CC 2012)
2.29 million websites (PLDs) out of 40.6 million
provide Microformat, Microdata or RDFa data
(5.65%)
369 million of the 3 billion pages contain
Microformat, Microdata or RDFa data (12.3%).
 Grouped by Alexa Website Popularity Rank
(site rank based on amount of page views)
POPULARITY OF WEBSITES
CONTAINING STRUCTURED DATA
BREAKDOWN BY ENCODING FORMAT
(CC 2012)
DISTRIBUTION BY TOP LEVEL DOMAIN
• Top Classes:
• Topics
• CMS and Blog
metadata
• Product data
• Ratings
• Navigational
metadata
• Company listings
RDFA TOPICS (CC 2012)
• Top Classes:
• Topics
• CMS and Blog
metadata
• Navigational
metadata
• Products and offers
• Business listings
• Ratings
• Places
• Events
MICRODATA TOPICS (CC 2012)
datavoc = Google„s
Rich Snippet Vocabulary
CLASS / PROPERTY DISTRIBUTION
A small set of
classes / properties
is used.
Heterogenity on
schema level
easy to overcome.
MICROFORMATS
 Top Classes:
 Topics
 Persons
 Organisations
 Events
 Listings
and Reviews
 Recipes
LOOKING DEEPER INTO THE E-
COMMERCE DATA
• Microdata, 2012
SHOPS BY PRODUCT CATEGORY
• Classifier trained for 9 product categories on descriptions from Amazon.
• Examined 9000 English-language shops.
• Microdata, 2012
Looking Deeper into Job Postings
hiringOrganization: 40% String, 60 % Object
Schema.org
WEB COMMON DATA 
GLOBAL DATA SPACE
PRESENT  FUTURE
TAKEAWAYS
• Linked Open Data is a great vision
• LOD cloud contains lots of data that we CAN
consume
• Common crawl database lowers the bar for web-
scale R&D
• Web Data Commons is a good quality semantic
dataset
• Web Data Commons offers opportunities for easy
access of large amount of semantic data
CHALLENGES
• LOD is still sparse or at least spotty
• LOD is mostly brittle (not much statistics built-in)
• Global data space is just started forming
• Data integration requires efforts and may contain
errors
• Sophisticated Natural Language Processing work is
required to get data analyzed and utilized
THANK YOU!
CREDITS: CHRIS BIZER, OLIVER GRISEL, SOREN AUER

More Related Content

What's hot

Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlPrimal Pappachan
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudMichael Stack
 
Apache Phoenix Query Server
Apache Phoenix Query ServerApache Phoenix Query Server
Apache Phoenix Query ServerJosh Elser
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingDatabricks
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
REST and Microservices
REST and MicroservicesREST and Microservices
REST and MicroservicesShaun Abram
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Aligner vos données avec Wikidata grâce à l'outil Open Refine
Aligner vos données avec Wikidata grâce à l'outil Open RefineAligner vos données avec Wikidata grâce à l'outil Open Refine
Aligner vos données avec Wikidata grâce à l'outil Open RefineGautier Poupeau
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automationBHAWESH RAJPAL
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBScyllaDB
 
SIBus Tuning for production WebSphere Application Server
SIBus Tuning for production WebSphere Application Server SIBus Tuning for production WebSphere Application Server
SIBus Tuning for production WebSphere Application Server Rohit Kelapure
 

What's hot (20)

Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
NiFi 시작하기
NiFi 시작하기NiFi 시작하기
NiFi 시작하기
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
ssis lab
ssis labssis lab
ssis lab
 
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and CloudHBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
 
Apache Phoenix Query Server
Apache Phoenix Query ServerApache Phoenix Query Server
Apache Phoenix Query Server
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta Sharing
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
REST and Microservices
REST and MicroservicesREST and Microservices
REST and Microservices
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Aligner vos données avec Wikidata grâce à l'outil Open Refine
Aligner vos données avec Wikidata grâce à l'outil Open RefineAligner vos données avec Wikidata grâce à l'outil Open Refine
Aligner vos données avec Wikidata grâce à l'outil Open Refine
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automation
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Maven Nexus
Maven NexusMaven Nexus
Maven Nexus
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
 
SIBus Tuning for production WebSphere Application Server
SIBus Tuning for production WebSphere Application Server SIBus Tuning for production WebSphere Application Server
SIBus Tuning for production WebSphere Application Server
 

Similar to The Web of data and web data commons

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentPeter Haase
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 
Introduction to APIs and Linked Data
Introduction to APIs and Linked DataIntroduction to APIs and Linked Data
Introduction to APIs and Linked DataAdrian Stevenson
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapubeswcsummerschool
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareIMC Technologies
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data GenerationFilip Radulovic
 
Using Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyUsing Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyLeila Zemmouchi-Ghomari
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGGRatko Mutavdzic
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeDan Brickley
 
Linked Data, Library Users, and the Discovery Tools of the Future
Linked Data, Library Users, and the Discovery Tools of the FutureLinked Data, Library Users, and the Discovery Tools of the Future
Linked Data, Library Users, and the Discovery Tools of the FutureEmily Nimsakont
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgOCLC
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Cory Lampert
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 

Similar to The Web of data and web data commons (20)

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Linked Data
Linked DataLinked Data
Linked Data
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
Introduction to APIs and Linked Data
Introduction to APIs and Linked DataIntroduction to APIs and Linked Data
Introduction to APIs and Linked Data
 
Linked data 20171106
Linked data 20171106Linked data 20171106
Linked data 20171106
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
Using Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyUsing Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case study
 
(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG(PROJEKTURA) Big Data Open Data story for TGG
(PROJEKTURA) Big Data Open Data story for TGG
 
SemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in PracticeSemWeb Fundamentals - Info Linking & Layering in Practice
SemWeb Fundamentals - Info Linking & Layering in Practice
 
Linked Data, Library Users, and the Discovery Tools of the Future
Linked Data, Library Users, and the Discovery Tools of the FutureLinked Data, Library Users, and the Discovery Tools of the Future
Linked Data, Library Users, and the Discovery Tools of the Future
 
Describing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.orgDescribing Theses and Dissertations Using Schema.org
Describing Theses and Dissertations Using Schema.org
 
Finding Data Sets
Finding Data SetsFinding Data Sets
Finding Data Sets
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Linked Data to Improve the OER Experience
Linked Data to Improve the OER ExperienceLinked Data to Improve the OER Experience
Linked Data to Improve the OER Experience
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Danbri Drupalcon Export
Danbri Drupalcon ExportDanbri Drupalcon Export
Danbri Drupalcon Export
 

More from Jesse Wang

Agile lean workshop
Agile lean workshopAgile lean workshop
Agile lean workshopJesse Wang
 
Big data analytic platform
Big data analytic platformBig data analytic platform
Big data analytic platformJesse Wang
 
Social shopping with semantic power
Social shopping with semantic powerSocial shopping with semantic power
Social shopping with semantic powerJesse Wang
 
Smart datamining semtechbiz 2013 report
Smart datamining semtechbiz 2013 reportSmart datamining semtechbiz 2013 report
Smart datamining semtechbiz 2013 reportJesse Wang
 
Hybrid system architecture overview
Hybrid system architecture overviewHybrid system architecture overview
Hybrid system architecture overviewJesse Wang
 
Experiment on Knowledge Acquisition
Experiment on Knowledge AcquisitionExperiment on Knowledge Acquisition
Experiment on Knowledge AcquisitionJesse Wang
 
Chinese New Year
Chinese New Year Chinese New Year
Chinese New Year Jesse Wang
 
SemTech 2012 Talk semantify office
SemTech 2012 Talk  semantify officeSemTech 2012 Talk  semantify office
SemTech 2012 Talk semantify officeJesse Wang
 
Building SMWCon Spring 2012 Site
Building SMWCon Spring 2012 SiteBuilding SMWCon Spring 2012 Site
Building SMWCon Spring 2012 SiteJesse Wang
 
SMWCon Spring 2012 SMW+ Team Dev Update
SMWCon Spring 2012 SMW+ Team Dev UpdateSMWCon Spring 2012 SMW+ Team Dev Update
SMWCon Spring 2012 SMW+ Team Dev UpdateJesse Wang
 
SMWCon Spring 2012 Welcome Remarks
SMWCon Spring 2012 Welcome RemarksSMWCon Spring 2012 Welcome Remarks
SMWCon Spring 2012 Welcome RemarksJesse Wang
 
Pre-SMWCon Spring 2012 meetup (short)
Pre-SMWCon Spring 2012 meetup (short)Pre-SMWCon Spring 2012 meetup (short)
Pre-SMWCon Spring 2012 meetup (short)Jesse Wang
 
Msra talk smw+apps
Msra talk smw+appsMsra talk smw+apps
Msra talk smw+appsJesse Wang
 
Jist tutorial semantic wikis and applications
Jist tutorial   semantic wikis and applicationsJist tutorial   semantic wikis and applications
Jist tutorial semantic wikis and applicationsJesse Wang
 
Semantic Wiki Page Maker
Semantic Wiki Page MakerSemantic Wiki Page Maker
Semantic Wiki Page MakerJesse Wang
 
Facets of applied smw
Facets of applied smwFacets of applied smw
Facets of applied smwJesse Wang
 
Smwcon widget editor - first preview
Smwcon widget editor - first previewSmwcon widget editor - first preview
Smwcon widget editor - first previewJesse Wang
 
Microsoft Office Connector Update at SMWCon Spring 2011
Microsoft Office Connector Update at SMWCon Spring 2011Microsoft Office Connector Update at SMWCon Spring 2011
Microsoft Office Connector Update at SMWCon Spring 2011Jesse Wang
 
Smwcon spring2011 tutorial applied semantic mediawiki
Smwcon spring2011 tutorial applied semantic mediawikiSmwcon spring2011 tutorial applied semantic mediawiki
Smwcon spring2011 tutorial applied semantic mediawikiJesse Wang
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionJesse Wang
 

More from Jesse Wang (20)

Agile lean workshop
Agile lean workshopAgile lean workshop
Agile lean workshop
 
Big data analytic platform
Big data analytic platformBig data analytic platform
Big data analytic platform
 
Social shopping with semantic power
Social shopping with semantic powerSocial shopping with semantic power
Social shopping with semantic power
 
Smart datamining semtechbiz 2013 report
Smart datamining semtechbiz 2013 reportSmart datamining semtechbiz 2013 report
Smart datamining semtechbiz 2013 report
 
Hybrid system architecture overview
Hybrid system architecture overviewHybrid system architecture overview
Hybrid system architecture overview
 
Experiment on Knowledge Acquisition
Experiment on Knowledge AcquisitionExperiment on Knowledge Acquisition
Experiment on Knowledge Acquisition
 
Chinese New Year
Chinese New Year Chinese New Year
Chinese New Year
 
SemTech 2012 Talk semantify office
SemTech 2012 Talk  semantify officeSemTech 2012 Talk  semantify office
SemTech 2012 Talk semantify office
 
Building SMWCon Spring 2012 Site
Building SMWCon Spring 2012 SiteBuilding SMWCon Spring 2012 Site
Building SMWCon Spring 2012 Site
 
SMWCon Spring 2012 SMW+ Team Dev Update
SMWCon Spring 2012 SMW+ Team Dev UpdateSMWCon Spring 2012 SMW+ Team Dev Update
SMWCon Spring 2012 SMW+ Team Dev Update
 
SMWCon Spring 2012 Welcome Remarks
SMWCon Spring 2012 Welcome RemarksSMWCon Spring 2012 Welcome Remarks
SMWCon Spring 2012 Welcome Remarks
 
Pre-SMWCon Spring 2012 meetup (short)
Pre-SMWCon Spring 2012 meetup (short)Pre-SMWCon Spring 2012 meetup (short)
Pre-SMWCon Spring 2012 meetup (short)
 
Msra talk smw+apps
Msra talk smw+appsMsra talk smw+apps
Msra talk smw+apps
 
Jist tutorial semantic wikis and applications
Jist tutorial   semantic wikis and applicationsJist tutorial   semantic wikis and applications
Jist tutorial semantic wikis and applications
 
Semantic Wiki Page Maker
Semantic Wiki Page MakerSemantic Wiki Page Maker
Semantic Wiki Page Maker
 
Facets of applied smw
Facets of applied smwFacets of applied smw
Facets of applied smw
 
Smwcon widget editor - first preview
Smwcon widget editor - first previewSmwcon widget editor - first preview
Smwcon widget editor - first preview
 
Microsoft Office Connector Update at SMWCon Spring 2011
Microsoft Office Connector Update at SMWCon Spring 2011Microsoft Office Connector Update at SMWCon Spring 2011
Microsoft Office Connector Update at SMWCon Spring 2011
 
Smwcon spring2011 tutorial applied semantic mediawiki
Smwcon spring2011 tutorial applied semantic mediawikiSmwcon spring2011 tutorial applied semantic mediawiki
Smwcon spring2011 tutorial applied semantic mediawiki
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in Action
 

Recently uploaded

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 

Recently uploaded (20)

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 

The Web of data and web data commons

  • 1. T H I S I S A M I X E D D E C K W I T H S L I D E S F R O M P R O F . D R . C H R I S B I Z E R , O L I V E R G R I S E L , S O R E N A U E R A N D J E S S E W A N G THE WEB OF DATA
  • 2. AGENDA Introduction to the Web of (Open Semantic) Data Linked Open Data and 5-star Data Principles DBpedia – Query Wikipedia as a database Linked Data Integration Framework Common Crawl Database Web Data Commons Summary
  • 3.
  • 4. 11/7/11 “To a computer, then, the web is a flat, boring world devoid of meaning” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  • 5. 11/7/11 “This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  • 6. “Adding semantics to the web involves two things: allowing documents which have information in machine-readable forms, and allowing links to be created with relationship values.” Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. 11/7/11 THE WEB OF DATA - HOW? RDF / Triple Stores / SPARQL Graph stores with dynamic schemas Strong interoperability JSON-LD Upgrade your JSON with scoped vocabularies Web / Mobile / JS developer friendly RDFa + schema.org & rNews Publish annotation in structured markup Vocabulary understood by Search Engines
  • 12. 11/7/11 THE WEB OF DATA - WHAT? Linked Open Data Started with DBpedia – Wikipedia as database In 2011.09, LOD cloud has near 300 datasets Web Data Commons Based on Common Crawl Database LOD + OpenGraph + Schema.org Knowledge-Bases? Can we be a valuable contributor?
  • 13. LINKED DATA PARADIGM Use URIs as names for things Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information. Include links to other URIs. so that they can discover more things.
  • 14. 5 ★ OPEN DATA Tim Berners-Lee, inventor of the Web and Linked Data initiator, suggested a 5 star deployment scheme for Open Data. Here, we give examples for each step of the stars and explain costs and benefits that come along with it. http://5stardata.info/
  • 15. AND IT STARTS WITH…
  • 16.
  • 17. DBPEDIA Joined project to • create a huge, multi-lingual knowledge base • by extracting structured information from Wikipedia • make the knowledge base available on the Web as Linked Data under an open license
  • 18. WE HELPED DBPEDIA (3.5, 2010.4) • Extraction framework completely rewritten • Mapping language redesigned • Hosted on a wiki http://mappings.dbpedi a.org • A lot more things extracted • … 0 200 400 600 800 1000 1200 DBPEDIA 3.4 DBPEDIA 3.5 Total Triples
  • 21. DBPEDIA 3.8 (NOW) • Structured Information in Wikipedia • infoboxes • geo-coordinates • categorization of articles • inter-language links • links to images and external webpages • titles and abstracts • tables and lists • Currently 111 localized editions
  • 22. Category Instances Statements Distinct Properties Person 871,630 18,323,794 6,195,234 Artist 100,793 3,723,440 998,616 Actor 25,340 1,070,066 247,690 Musical Artist 46,364 2,069,152 550,225 Athlete 217,067 6,373,136 1,853,233 Politician 41,126 1,407,548 454,209 Place 643,260 24,698,893 8,026,305 Building 65,355 1,058,610 530,010 Airport 11,675 352,377 138,944 Bridge 3,425 66,968 34,470 Skyscraper 68 3,091 719 Populated Place 424,291 20,565,679 6,212,991 River 26,892 681,782 208,146 Organisation 206,670 4,940,190 2,029,620 Band 29,101 1,126,744 298,743 Company 48,989 1,048,251 445,758 Educ.Institution 43,250 958,257 493,792 Work 360,808 9,649,228 3,566,511 Book 44,339 1,111,960 408,724 Film 75,067 2,663,487 787,129 Musical Work 160,383 4,116,625 1,635,655 Album 122,729 3,400,942 1,224,746 Single 42,393 1,226,636 534,023 Software 28,930 731,138 242,411 Television Show 24,784 565,136 282,594 0 10,000,000 20,000,000 30,000,000 Person Artist Actor Musical Artist Athlete Politician Place Building Airport Bridge Skyscraper Populated Place River Organisation Band Company Educ.Institution Work Book Film Musical Work Album Single Software Television Show Distinct Properties Statements Instances
  • 24. CONSUMING LINKED DATA Browsers • LOD Cloud http://datahub.io • Tabulator • Disco • Linked Open Data Explorer • Marbles • ObjectViewer Search Engines • Sameas.org • Sindice • Sig.ma • LOD Cache (Virtuoso by OpenLinkSoftware) • SWSE - DERI • VisiNav • Falcon • Swoogle
  • 25. LDIF – LINKED DATA INTEGRATION FRAMEWORK • Single Machine / Hadoop Version • tested with 3.6 billion RDF quads
  • 27. LEARNING LINKAGE RULES USING GENETIC PROGRAMMING  based on existing reference links  GenLink learns  comparisons  aggregations  transformations  weights  instead of subtree crossover, we use a set of custom crossover operators Aggregation Crossover Transformation Crossover
  • 28. RESULTS FOR THE CORA EVALUATION DATA SET  Citations to research papers from the Cora research paper search engine  Attributes: Title, Author, Venue, Date of publication  Reference Links: 1600  GenLink achieved an F-measure 96.6% against the validation set.  Carvalho et al. report an F-measure of 91.0 % against the validation set (last line).
  • 29. LEARNED RULE Robert Isele and Christian Bizer: Learning Expressive Linkage Rules using Genetic Programming. PVLDB 5(11):1638-1649, 2012
  • 30. ACTIVE LEARNING OF LINKAGE RULES • Query Strategy: Select the link candidate for which the linkage rules in the current population disagree the most.
  • 31. STRUCTURED DATA ON THE WEB WE HAVE THE TOOLS NOW
  • 32. HTML-EMBEDDED STRUCTURED DATA ON THE WEB More and more Websites semantically markup the content of their HTML pages. Microformats Microdata RDFa
  • 33. MICROFORMATS • Microformat effort dates back to 2003 • Small set of fixed formats • hcard : people, companies, organizations, and places • XFN : relationships between people • hCalendar : calendaring and events • hListing : small-ads; classifieds • hReview : reviews of products, businesses, events • Shortcoming of Microformats • can not represent any kind of data. • indexed by Google and Yahoo since 2009
  • 34. RDFA • serialization format for embedding RDF data into HTML pages • proposed in 2004, W3C Recommendation in 2008 • can be used together with any vocabulary • can assign URIs as global primary keys to entities
  • 35. OPEN GRAPH PROTOCOL • allows site owners to determine how entities are described in Facebook • relies on RDFa for encoding data in HTML pages • available since April 2010
  • 36. MICRODATA • alternative technique for embedding structured data • proposed in 2009 by WHATWG as part of HTML5 work • tries to be simpler than RDFa (5 new attributes instead of 8) • W3C currently tries to reconcile the two alternative proposals
  • 37. SCHEMA.ORG • ask site owners to embed data to enrich search results. • 200+ Types: Event, Organization, Person, Place, Product, Review • Encoding: Microdata or alternatively RDFa
  • 38. USAGE OF SCHEMA.ORG DATA @ GOOGLE Answers to fact queries Data snippets within search results Data tables within search results
  • 39.
  • 40. THE COMMON CRAWL CORPORA • Provides two web corpora on Amazon S3 • 2009/2010 Corpus: 2.5 billion HTML pages • June 2012 Corpus: 3.0 billion HTML pages • The June 2012 Corpus • unique HTML pages: 3,005,629,093 • pay-level-domains (PLDs): 40.6 million • size of the corpus in compressed form: 48 terabyte • Crawler uses PageRank to decide which pages to retrieve snapshot of the popular part of the Web number of pages per site varies widely • youtube.com: 93.1 million pages • 37.5 million PLDs with less than 100 pages
  • 42. WEB DATA COMMONS • WebDataCommons.org Project • extracts all Microformat, Microdata, RDFa data from the Common Crawl • provides the extracted data for free download • Two extractions runs • 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples • 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples • Jointed project of
  • 43. THE WDC EXTRACTION FRAMEWORK • 700.000 input files queued in SQS • EC2 workers take tasks from SQS • Workers read and write S3 buckets S3 SQS 42 EC2 ... 42 43 ... CC R42 R43 ... WDC Workers  100 spot instances of type c1.xlarge (7G RAM, 8 cores)  5600 machine/hours  398 US$
  • 44. WEBSITES CONTAINING STRUCTURED DATA (CC 2012) 2.29 million websites (PLDs) out of 40.6 million provide Microformat, Microdata or RDFa data (5.65%) 369 million of the 3 billion pages contain Microformat, Microdata or RDFa data (12.3%).
  • 45.  Grouped by Alexa Website Popularity Rank (site rank based on amount of page views) POPULARITY OF WEBSITES CONTAINING STRUCTURED DATA
  • 46. BREAKDOWN BY ENCODING FORMAT (CC 2012)
  • 47. DISTRIBUTION BY TOP LEVEL DOMAIN
  • 48. • Top Classes: • Topics • CMS and Blog metadata • Product data • Ratings • Navigational metadata • Company listings RDFA TOPICS (CC 2012)
  • 49. • Top Classes: • Topics • CMS and Blog metadata • Navigational metadata • Products and offers • Business listings • Ratings • Places • Events MICRODATA TOPICS (CC 2012) datavoc = Google„s Rich Snippet Vocabulary
  • 50. CLASS / PROPERTY DISTRIBUTION A small set of classes / properties is used. Heterogenity on schema level easy to overcome.
  • 51. MICROFORMATS  Top Classes:  Topics  Persons  Organisations  Events  Listings and Reviews  Recipes
  • 52. LOOKING DEEPER INTO THE E- COMMERCE DATA • Microdata, 2012
  • 53. SHOPS BY PRODUCT CATEGORY • Classifier trained for 9 product categories on descriptions from Amazon. • Examined 9000 English-language shops.
  • 54. • Microdata, 2012 Looking Deeper into Job Postings hiringOrganization: 40% String, 60 % Object Schema.org
  • 55. WEB COMMON DATA  GLOBAL DATA SPACE PRESENT  FUTURE
  • 56. TAKEAWAYS • Linked Open Data is a great vision • LOD cloud contains lots of data that we CAN consume • Common crawl database lowers the bar for web- scale R&D • Web Data Commons is a good quality semantic dataset • Web Data Commons offers opportunities for easy access of large amount of semantic data
  • 57. CHALLENGES • LOD is still sparse or at least spotty • LOD is mostly brittle (not much statistics built-in) • Global data space is just started forming • Data integration requires efforts and may contain errors • Sophisticated Natural Language Processing work is required to get data analyzed and utilized
  • 58. THANK YOU! CREDITS: CHRIS BIZER, OLIVER GRISEL, SOREN AUER