SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
A Scalable Approach for Efficiently Generating
Structured Dataset Topic Profiles
Besnik Fetahu1
, Stefan Dietze1
, Bernardo Pereira Nunes2
, Marco
Antonio Casanova2
, Davide Taibi3
, Wolfgang Nejdl1
1L3S Research Center, Leibniz Universit¨at Hannover
2Department of Informatics - PUC-Rio
3Institute for Educational Technologies, CNR
May 29, 2014
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality and
domains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
• Growing number of datasets: 227 datasets
• Data represented as triples: 31 billion triples
• Multi-lingual content: 18 languages
• Broad set of topics covered
• Inter-dataset links
Domain # Data. Triples
Media 25 1,841,852,061
Geographic 31 6,145,532,484
Government 49 13,315,009,400
Publications 87 2,950,720,693
Cross-domain 41 4,184,635,715
Life sciences 41 3,036,336,004
User-generated 20 134,127,413
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?
• 38 out of 228 datasets contain
topic coverage information.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?
• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?
• regex(*) filter clause needs
to check all triples that contain
a specific keyword.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?
• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?
• regex(*) filter clause needs
to check all triples that contain
a specific keyword.
• Disambiguity: What are all the possible forms of
renewable energy?
• solar energy, wind energy, geothermal. . .
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Instance and Type Extraction
• Simple SPARQL SELECT queries
• Avg. indexing time 10% (7min) vs. 100% (4hrs).
• Approximately ∼300 million resource instances
10
100
1000
10000
100000 uriburner
bluk-bnb
bio2rdf-kegg-pathway
nom
enclator-asturias
b3katlobid-resources
twc-ieeevis
educationalprogram
ssisvu
farm
bio-chem
bl
world-bank-linked-data
event-m
edia
eea
eunishungarian-national-library-catalog
bio2rdf-pubm
ed
linked-user-feedback
oecd-linked-data
bio2rdf-goa
pscs-catalogue
bio2rdf-genbank
linkedm
db
bfs-linked-data
bio2rdf-reactom
e
british-m
useum
-collection
bio2rdf-ncbigene
datos-bcn-cl
l3s-dblp
bio2rdf-sgd
hellenic-fire-brigade
Log-scaleindexingtime
100%
10%
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Sampling Approaches
Entity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
1
DBpedia Spotlight
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Sampling Approaches
Entity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatype
properties used to describe it wk = |f (rk)|/max{|f (rj )|}
1
DBpedia Spotlight
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Resource Sampling Approaches
Entity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatype
properties used to describe it wk = |f (rk)|/max{|f (rj )|}
• centrality: weigh a resource by the number of types
used to describe it ck = |Ck|/|C|
Topic Extraction
• Resources as documents by combining all textual literals
• Perform NED1 and extract corresponding DBpedia entities
• Extract topics as DBpedia categories from entities via
dcterms:subject
1
DBpedia Spotlight
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Constructing profiles: Dataset-topic graph
1 Profile graph nodes: datasets,
resources, topics
2 Weighted graph edges: ∆ D, t
3 Edge weights: ∆ Di , t = ∆ Dj , t
4 Compute ∆ Di , t by assessing the
importance of t given the resources
of Di as prior knowledge
5 The given prior knowledge biases
the importance of t in the profile
graph towards Di
2
6 Incrementally add datasets in the
profile graph, by simply computing
the weights ∆ Dk , t
2
Scott White and Padhraic Smyth. 2003. Algorithms for estimating relative
importance in networks. In 9th ACM SIGKDD (KDD ’03).
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Topic Ranking Approaches
Topic filtering
Topic pre-filtering:
NTR(t, D) =
Φ(·, D)
Φ(t, D)
+
Φ(·, ·)
Φ(t, ·)
• Filter noisy topics
• φ(·, ·) - number of entities
associated with topic t
• Closely related to the tf-idf
weighting scheme
Topic Ranking
• PageRank with Priors (PRankP)
• HITS with Priors (HITSP)
• K-Step Markov (KStepM)
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Experimental Setup
Datasets and Ground-truth
• 129 dataset from lod-cloud3
• 6 ground-truth datasets with manually assigned topic
indicators for their resources
Dataset Properties #Resources
yovisto
skos:subject, dbp:{subject, class,
discipline, kategorie, tagline}
62879
oxpoints dcterms:subject,dc:subject 37258
socialsemweb-thesaurus
skos:subject, tag:associatedTag,
dcterms:subject
2243
semantic-web-dog-food dcterms:subject, dc:subject 20145
lak-dataset dcterms:subject, dc:subject 1691
Evaluation Metrics
• NDCG@k (k=1, . . . , 1000)
• Compare the induced ranking by the graphical models
against the ideal ranking
3
At the time of experimentation only 129 dataset endpoints were
responsive.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Baselines
• tf-idf: Consider resources as documents. Extract for
each dataset the top {50, 100, 150, 200} terms.
• LDA: Consider dataset as documents4. Extract top
weighted topic terms. For every dataset extract top {50,
100, 150, 200} with a number of topics {10, 20, 30, 40,
50}.
4
In this case it does not matter if datasets are considered at the
resource level or aggregated.
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 Approach
Resource Instance and Type Extraction
Resource Sampling Approaches
Constructing profiles: Dataset-topic graph
Topic Ranking Approaches
4 Experimental Setup
Baselines
5 Evaluation Results
Efficiency of Dataset Profiling
Scalability of Dataset Profiling
6 Conclusions
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Efficiency of Dataset Profiling
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 100 200 300 400 500 600 700 800 900 1000
NDCGrankingscore
NDCG rank
Profiling accuracy for all topic ranking approaches
K-Step Markov + NTR
PageRank with priors + NTR
HITS with priors + NTR
LDA
tf-idf
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Sample Size
K-Step Markov profiling accuracy (Centrality Sampling)
KStepM + NTR
KStepM
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Scalability of Dataset Profiling
100
1000
10000
0 20 40 60 80 100
0
0.05
0.1
0.15
0.2
0.25
0.3
Log-scaletimeperformance
NDCGrankingscore
Sample Size
Time Performance vs. Profiling Accuracy
HITS with priors time
HITS with priors ranking
K-Step Markov time
K-Step Markov ranking
PageRank with priors time
PageRank with priors ranking
• 5% and 10% already provide
stable profiling accuracy
• Avg. 7mins for indexing 10%
of resources per dataset vs.
4hrs per dataset
• 2mins for ranking dataset
profiles with 10% of resources
vs. 45mins for 100%
• NED runtime 10% vs. 100%?
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Motivation Example Revisited!
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Conclusions and Future Work
• Structured dataset profiles
• Scalable approach through sampling
• Efficient profiling through topic filtering and ranking
• Incremental generation of dataset profiles
• Dataset profiles as a set of links (entity and topic links)
• Provenance information of links (e.g. resources from
which an entity is extracted)
• Profiles for dataset recommendation, search, etc.
Resources
• Profiles Endpoint:
http://data-observatory.org/lod-profiles/sparql
• Profiles Webpage:
http://data-observatory.org/lod-profiles/
Introduction
Problem and
Motivation
Approach
Resource
Instance and
Type Extraction
Resource
Sampling
Approaches
Constructing
profiles:
Dataset-topic
graph
Topic Ranking
Approaches
Experimental
Setup
Baselines
Evaluation
Results
Efficiency of
Dataset Profiling
Scalability of
Dataset Profiling
Conclusions
Thank you! Questions?
#eswc2014Fetahu

Más contenido relacionado

La actualidad más candente

Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalDustin Smith
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni
 
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsMatthias Braunhofer
 
Test PDF
Test PDFTest PDF
Test PDFAlgnuD
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documentssubash chandra
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Learnometrics: Metrics for Learning Objects
Learnometrics: Metrics for Learning ObjectsLearnometrics: Metrics for Learning Objects
Learnometrics: Metrics for Learning ObjectsXavier Ochoa
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
 
Co-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text ClassificationCo-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text Classificationpaperpublications3
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graphDing Li
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 

La actualidad más candente (20)

Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Mapping Keywords to
Mapping Keywords to Mapping Keywords to
Mapping Keywords to
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender SystemsHybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
Hybrid Solution of the Cold-Start Problem in Context-Aware Recommender Systems
 
Test PDF
Test PDFTest PDF
Test PDF
 
Semantic Annotation of Documents
Semantic Annotation of DocumentsSemantic Annotation of Documents
Semantic Annotation of Documents
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Learnometrics: Metrics for Learning Objects
Learnometrics: Metrics for Learning ObjectsLearnometrics: Metrics for Learning Objects
Learnometrics: Metrics for Learning Objects
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
Co-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text ClassificationCo-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text Classification
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
 
Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...Incremental learning from unbalanced data with concept class, concept drift a...
Incremental learning from unbalanced data with concept class, concept drift a...
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 TutorialInstance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
Instance Matching Benchmarks for Linked Data - ESWC 2016 Tutorial
 

Destacado

Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeStefan Dietze
 
Presentation nokobit
Presentation nokobitPresentation nokobit
Presentation nokobitnetsoxx
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsStefan Dietze
 
Quality criteria for architectural 3D data in usage and preservation processes
Quality criteria for architectural 3D data in usage and preservation processesQuality criteria for architectural 3D data in usage and preservation processes
Quality criteria for architectural 3D data in usage and preservation processeslindlar
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013Stefan Dietze
 
DURAARK at AUdS 2015
DURAARK at AUdS 2015DURAARK at AUdS 2015
DURAARK at AUdS 2015panitzm
 
Grapp2014 presentation
Grapp2014 presentationGrapp2014 presentation
Grapp2014 presentationnetsoxx
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...lindlar
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014panitzm
 
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...lindlar
 
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.Lena Lindbäck
 
DURAARK at Bibliotheksymposium Wildau
DURAARK at Bibliotheksymposium WildauDURAARK at Bibliotheksymposium Wildau
DURAARK at Bibliotheksymposium Wildaupanitzm
 
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...Jakob Beetz
 
Preservation of 3 d objects of buildings
Preservation of 3 d objects of buildingsPreservation of 3 d objects of buildings
Preservation of 3 d objects of buildingsnetsoxx
 

Destacado (15)

Towards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledgeTowards preservation of semantically enriched architectural knowledge
Towards preservation of semantically enriched architectural knowledge
 
Presentation nokobit
Presentation nokobitPresentation nokobit
Presentation nokobit
 
What's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked DatasetsWhat's all the data about? - Linking and Profiling of Linked Datasets
What's all the data about? - Linking and Profiling of Linked Datasets
 
Quality criteria for architectural 3D data in usage and preservation processes
Quality criteria for architectural 3D data in usage and preservation processesQuality criteria for architectural 3D data in usage and preservation processes
Quality criteria for architectural 3D data in usage and preservation processes
 
KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013KnowEscape workshop, OKCon 2013
KnowEscape workshop, OKCon 2013
 
DURAARK at AUdS 2015
DURAARK at AUdS 2015DURAARK at AUdS 2015
DURAARK at AUdS 2015
 
Grapp2014 presentation
Grapp2014 presentationGrapp2014 presentation
Grapp2014 presentation
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
DURAARK presentation at DEDICATE final seminar, October 21st 2013, Michelle L...
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014
 
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
A Domain-driven Approach to Digital Curation and Preservation of 3D Architect...
 
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
Presentation of the DURAARK project at Ex Libris conference, Berlin, Germany.
 
DURAARK at Bibliotheksymposium Wildau
DURAARK at Bibliotheksymposium WildauDURAARK at Bibliotheksymposium Wildau
DURAARK at Bibliotheksymposium Wildau
 
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
DURAARK presentation CIB W78 "Applications of IT in AEC" conference Beijing 2...
 
Preservation of 3 d objects of buildings
Preservation of 3 d objects of buildingsPreservation of 3 d objects of buildings
Preservation of 3 d objects of buildings
 

Similar a A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles

Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisDmitry Grapov
 
Qualitative and quantitative analysis
Qualitative and quantitative analysisQualitative and quantitative analysis
Qualitative and quantitative analysisNellie Deutsch (Ed.D)
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2OSri Ambati
 
PDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentationPDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentationDBOnto
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustMenchita Falcutila Dumlao
 
1.2 Motivating Challenges As mentioned earlier, traditional data
1.2 Motivating Challenges As mentioned earlier, traditional data1.2 Motivating Challenges As mentioned earlier, traditional data
1.2 Motivating Challenges As mentioned earlier, traditional dataSantosConleyha
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research DesignSYIKIN MARIA
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Data Quality
Data QualityData Quality
Data Qualityjerdeb
 
Synthese Recommender System
Synthese Recommender SystemSynthese Recommender System
Synthese Recommender SystemAndre Vellino
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategyAnton Yuryev
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization CS, NcState
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 

Similar a A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles (20)

Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysis
 
Qualitative and quantitative analysis
Qualitative and quantitative analysisQualitative and quantitative analysis
Qualitative and quantitative analysis
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
PDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentationPDQ: Proof-driven Querying presentation
PDQ: Proof-driven Querying presentation
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
 
1.2 Motivating Challenges As mentioned earlier, traditional data
1.2 Motivating Challenges As mentioned earlier, traditional data1.2 Motivating Challenges As mentioned earlier, traditional data
1.2 Motivating Challenges As mentioned earlier, traditional data
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research Design
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research Design
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Quality
Data QualityData Quality
Data Quality
 
Synthese Recommender System
Synthese Recommender SystemSynthese Recommender System
Synthese Recommender System
 
ELSS use cases and strategy
ELSS use cases and strategyELSS use cases and strategy
ELSS use cases and strategy
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 

Más de Besnik Fetahu

Approaches for Improving and Enriching Textual Knowledge Bases
Approaches for Improving and Enriching Textual Knowledge BasesApproaches for Improving and Enriching Textual Knowledge Bases
Approaches for Improving and Enriching Textual Knowledge BasesBesnik Fetahu
 
Fine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipediaFine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipediaBesnik Fetahu
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For WikipediaBesnik Fetahu
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataBesnik Fetahu
 
Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesBesnik Fetahu
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?Besnik Fetahu
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)Besnik Fetahu
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesBesnik Fetahu
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Besnik Fetahu
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphBesnik Fetahu
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingBesnik Fetahu
 

Más de Besnik Fetahu (11)

Approaches for Improving and Enriching Textual Knowledge Bases
Approaches for Improving and Enriching Textual Knowledge BasesApproaches for Improving and Enriching Textual Knowledge Bases
Approaches for Improving and Enriching Textual Knowledge Bases
 
Fine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipediaFine Grained Citation Span for References in WIkipedia
Fine Grained Citation Span for References in WIkipedia
 
Finding News Citations For Wikipedia
Finding News Citations For WikipediaFinding News Citations For Wikipedia
Finding News Citations For Wikipedia
 
Improving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured DataImproving Entity Retrieval on Structured Data
Improving Entity Retrieval on Structured Data
 
Automated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity PagesAutomated News Suggestions for Populating Wikipedia Entity Pages
Automated News Suggestions for Populating Wikipedia Entity Pages
 
How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?How much is Wikipedia lagging behind News?
How much is Wikipedia lagging behind News?
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
Summaries on the fly: Query-based Extraction of Structured Knowledge from Web...
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
Combining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linkingCombining a co-occurrence-based and a semantic measure for entity linking
Combining a co-occurrence-based and a semantic measure for entity linking
 

Último

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Último (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles

  • 1. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles Besnik Fetahu1 , Stefan Dietze1 , Bernardo Pereira Nunes2 , Marco Antonio Casanova2 , Davide Taibi3 , Wolfgang Nejdl1 1L3S Research Center, Leibniz Universit¨at Hannover 2Department of Informatics - PUC-Rio 3Institute for Educational Technologies, CNR May 29, 2014
  • 2. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 3. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 4. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data
  • 5. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains
  • 6. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets
  • 7. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets
  • 8. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis
  • 9. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis • Efficiency heavily dependent on information need
  • 10. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Introduction • Increasing amount of Web Data • Data heterogeneity: representation, language, quality and domains • Sparsely connected datasets • Lack of descriptive metadata about datasets • Exhaustive techniques for data analysis • Efficiency heavily dependent on information need • Ease of access and representation of datasets
  • 11. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 12. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? • Growing number of datasets: 227 datasets • Data represented as triples: 31 billion triples • Multi-lingual content: 18 languages • Broad set of topics covered • Inter-dataset links Domain # Data. Triples Media 25 1,841,852,061 Geographic 31 6,145,532,484 Government 49 13,315,009,400 Publications 87 2,950,720,693 Cross-domain 41 4,184,635,715 Life sciences 41 3,036,336,004 User-generated 20 134,127,413
  • 13. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information.
  • 14. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information. • Scalability: Use SPARQL filter clause? • regex(*) filter clause needs to check all triples that contain a specific keyword.
  • 15. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Why dataset profiling? Find datasets covering the domain of “Renewable Energy”? • Sparsity: Datasets that cover the topic? • 38 out of 228 datasets contain topic coverage information. • Scalability: Use SPARQL filter clause? • regex(*) filter clause needs to check all triples that contain a specific keyword. • Disambiguity: What are all the possible forms of renewable energy? • solar energy, wind energy, geothermal. . .
  • 16. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 17. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction
  • 18. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling
  • 19. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction
  • 20. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction 4 Profile graphs
  • 21. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Profiling Overview 1 Metadata extraction 2 Resource sampling 3 Entity/topic extraction 4 Profile graphs 5 Profiles representation
  • 22. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 23. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 24. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 25. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Dataset Profiling Example
  • 26. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Instance and Type Extraction • Simple SPARQL SELECT queries • Avg. indexing time 10% (7min) vs. 100% (4hrs). • Approximately ∼300 million resource instances 10 100 1000 10000 100000 uriburner bluk-bnb bio2rdf-kegg-pathway nom enclator-asturias b3katlobid-resources twc-ieeevis educationalprogram ssisvu farm bio-chem bl world-bank-linked-data event-m edia eea eunishungarian-national-library-catalog bio2rdf-pubm ed linked-user-feedback oecd-linked-data bio2rdf-goa pscs-catalogue bio2rdf-genbank linkedm db bfs-linked-data bio2rdf-reactom e british-m useum -collection bio2rdf-ncbigene datos-bcn-cl l3s-dblp bio2rdf-sgd hellenic-fire-brigade Log-scaleindexingtime 100% 10%
  • 27. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis 1 DBpedia Spotlight
  • 28. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis • weighted: weigh a resource by the number of datatype properties used to describe it wk = |f (rk)|/max{|f (rj )|} 1 DBpedia Spotlight
  • 29. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Resource Sampling Approaches Entity and Topic Extraction Resource Sampling • random: randomly select a resource instance for analysis • weighted: weigh a resource by the number of datatype properties used to describe it wk = |f (rk)|/max{|f (rj )|} • centrality: weigh a resource by the number of types used to describe it ck = |Ck|/|C| Topic Extraction • Resources as documents by combining all textual literals • Perform NED1 and extract corresponding DBpedia entities • Extract topics as DBpedia categories from entities via dcterms:subject 1 DBpedia Spotlight
  • 30. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Constructing profiles: Dataset-topic graph 1 Profile graph nodes: datasets, resources, topics 2 Weighted graph edges: ∆ D, t 3 Edge weights: ∆ Di , t = ∆ Dj , t 4 Compute ∆ Di , t by assessing the importance of t given the resources of Di as prior knowledge 5 The given prior knowledge biases the importance of t in the profile graph towards Di 2 6 Incrementally add datasets in the profile graph, by simply computing the weights ∆ Dk , t 2 Scott White and Padhraic Smyth. 2003. Algorithms for estimating relative importance in networks. In 9th ACM SIGKDD (KDD ’03).
  • 31. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Topic Ranking Approaches Topic filtering Topic pre-filtering: NTR(t, D) = Φ(·, D) Φ(t, D) + Φ(·, ·) Φ(t, ·) • Filter noisy topics • φ(·, ·) - number of entities associated with topic t • Closely related to the tf-idf weighting scheme Topic Ranking • PageRank with Priors (PRankP) • HITS with Priors (HITSP) • K-Step Markov (KStepM)
  • 32. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 33. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Experimental Setup Datasets and Ground-truth • 129 dataset from lod-cloud3 • 6 ground-truth datasets with manually assigned topic indicators for their resources Dataset Properties #Resources yovisto skos:subject, dbp:{subject, class, discipline, kategorie, tagline} 62879 oxpoints dcterms:subject,dc:subject 37258 socialsemweb-thesaurus skos:subject, tag:associatedTag, dcterms:subject 2243 semantic-web-dog-food dcterms:subject, dc:subject 20145 lak-dataset dcterms:subject, dc:subject 1691 Evaluation Metrics • NDCG@k (k=1, . . . , 1000) • Compare the induced ranking by the graphical models against the ideal ranking 3 At the time of experimentation only 129 dataset endpoints were responsive.
  • 34. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Baselines • tf-idf: Consider resources as documents. Extract for each dataset the top {50, 100, 150, 200} terms. • LDA: Consider dataset as documents4. Extract top weighted topic terms. For every dataset extract top {50, 100, 150, 200} with a number of topics {10, 20, 30, 40, 50}. 4 In this case it does not matter if datasets are considered at the resource level or aggregated.
  • 35. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions 1 Introduction 2 Problem and Motivation 3 Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches 4 Experimental Setup Baselines 5 Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling 6 Conclusions
  • 36. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Efficiency of Dataset Profiling 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 100 200 300 400 500 600 700 800 900 1000 NDCGrankingscore NDCG rank Profiling accuracy for all topic ranking approaches K-Step Markov + NTR PageRank with priors + NTR HITS with priors + NTR LDA tf-idf 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Sample Size K-Step Markov profiling accuracy (Centrality Sampling) KStepM + NTR KStepM
  • 37. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Scalability of Dataset Profiling 100 1000 10000 0 20 40 60 80 100 0 0.05 0.1 0.15 0.2 0.25 0.3 Log-scaletimeperformance NDCGrankingscore Sample Size Time Performance vs. Profiling Accuracy HITS with priors time HITS with priors ranking K-Step Markov time K-Step Markov ranking PageRank with priors time PageRank with priors ranking • 5% and 10% already provide stable profiling accuracy • Avg. 7mins for indexing 10% of resources per dataset vs. 4hrs per dataset • 2mins for ranking dataset profiles with 10% of resources vs. 45mins for 100% • NED runtime 10% vs. 100%?
  • 38. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Motivation Example Revisited!
  • 39. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Conclusions and Future Work • Structured dataset profiles • Scalable approach through sampling • Efficient profiling through topic filtering and ranking • Incremental generation of dataset profiles • Dataset profiles as a set of links (entity and topic links) • Provenance information of links (e.g. resources from which an entity is extracted) • Profiles for dataset recommendation, search, etc. Resources • Profiles Endpoint: http://data-observatory.org/lod-profiles/sparql • Profiles Webpage: http://data-observatory.org/lod-profiles/
  • 40. Introduction Problem and Motivation Approach Resource Instance and Type Extraction Resource Sampling Approaches Constructing profiles: Dataset-topic graph Topic Ranking Approaches Experimental Setup Baselines Evaluation Results Efficiency of Dataset Profiling Scalability of Dataset Profiling Conclusions Thank you! Questions? #eswc2014Fetahu