Today, data about any domain can be found on the web in data repositories, web APIs and many millions of spreadsheets and CSV files. Researchers and organizations make these data available in a myriad of formats, layouts, terminology and cleanliness that make it difficult to integrate together. As a result, researchers aiming to use data in their analyses face three main challenges. The first one is finding datasets related to a feature, variable or topic of interest. For example, climate scientists need to look for years of observational data from authoritative sources when estimating the climate of a region. The second challenge is completing a given dataset with existing knowledge: machine learning applications are data hungry and require as many data points and features as possible to improve their predictions, which often requires integrating data from different sources. The third challenge is sharing integrated results: once several datasets have been merged together, how to make them available to the rest of the community?
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
1. WDPlus:
Leveraging Wikidata to Link and
Extend Tabular Data
Daniel Garijo, Pedro Szekely
Information Sciences Institute and
Department of Computer Science
@dgarijov
dgarijo@isi.edu
2. Abundance of data sources in the Web
Users of data face three challenges
• How do I find relevant datasets
for my problem?
• How do I augment my dataset
with existing information?
• How can I share my integrated
results with the community?
Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
2
Raw data
3. Popular initiatives for addressing these challenges
• Search individual items
• Search is manual, based on user input
• LOD cloud of connected datasets
• Knowledge engineers are needed to map and
augment content
• ETL Frameworks (e.g, Karma, Open Refine)
• Pipelines are custom, expertise required
• Often not shared
Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
3
Sources: https://lod-cloud.net/versions/2019-03-29/lod-cloud.png; https://panoply.io/data-warehouse-guide/3-ways-to-build-an-etl-process/
4. WDPlus
A framework designed to:
• Discover data on the Web
• Improve raw data to make it useful
• Search, querying dataset structure
• Download fresh data
• Combine existing dataset
• Share improved data and methods
Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
4
1953-01-20
Harry Truman
Lamar
President
USA
1884-05-08
1972-12-26
male
1945-04-12
Bress Truman
weath
er
shoppi
ng
crime
sports
Core
Satellites
Metadata index
5. WDPlus architecture
Wikidata as a core KG
• 60 Million items
• 700 Million statements
• 20,000 + contributors
• +1 billion edits
• Collaborative!
5
1953-01-20
Harry Truman
Lamar
President
USA
1884-05-08
1972-12-26
male
1945-04-12
Bress Truman
Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
Core
6. WDPlus architecture
Satellite organization
• Detailed information on a domain
• Crime records, sport events, etc.
• Linked to the Wikidata core
• Link first strategy
• Custom properties and Qnodes
• Extensions to core model
• Synchronized with core
• Decentralized
• 1 satellite may be maintained by 1
community
6Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
1953-01-20
Harry Truman
Lamar
President
USA
1884-05-08
1972-12-26
male
1945-04-12
Bress Truman
weath
er
shoppi
ng
crime
sports
Core
Satellites
7. WDPlus architecture
Table models
• Tables are not materialized
• Able to become a satellite under
demand
• Described in machine-readable
metadata index
• Indexing columns names and
relevant instances for fast retrieval
• Link to table model is preserved
7Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
1953-01-20
Harry Truman
Lamar
President
USA
1884-05-08
1972-12-26
male
1945-04-12
Bress Truman
weath
er
shoppi
ng
crime
sports
Core
Satellites
Metadata index
8. Towards WDPLus
8Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
1953-01-20
Harry Truman
Lamar
President
USA
1884-05-08
1972-12-26
male
1945-04-12
Bress Truman
weath
er
shoppi
ng
crime
sports
Core
Satellites
Metadata index
9. WDPlus framework: Metadata index and
table Augmentation
• Search
• Keywords, variables or content
• Wikifier may be used in search
• Download
• Download a dataset or its metadata
• Augment
• Merge your dataset with contents from
other datasets automatically
• Upload
• Add new datasets (automated metadata
profiling and provenance)
• Enrich
• Header enrichment for search efficiency
9Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
10. WDPlus framework: T2WML
10Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
Table overview
Cell-based mapping.
This mapping is
saved in WDPlus for
future reference
Entity Linking
Result sample
Easy to share!
11. Creating Wikidata Satellites: Challenges
• Identify new properties to model satellites
• Currently done by hand by Knowledge engineers
• Creation of new Qnodes for satellite instances
• Identified a schema for each satellite
• Feedback loop to Wikidata
• How to select a “trusty” statement when several values are available?
• Namespace issues
• Single namespace, or namespace per satellite?
• Inter-satellite linkages
11Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
12. Conclusions
• Tabular data exists in heterogeneous formats
• Difficult to find, use, augment and share
• WDPlus is a framework to help discover, improve, search, augment,
combine and share tabular data
• WDPlus framework for profiling and enriching datasets
• T2WML language to generate linked instances from tabular data
• Encouraging early results on usability
12Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
13. Help us extend WDPlus!
Do you have comments, suggestions or use cases? Contact me at:
Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend
Tabular Data. AKTS 2019 (eScience 2019)
13
dgarijo@isi.edu