ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
Fostering Serendipity through Big Linked Data
1. Fostering Serendipity through Big
Linked Data
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal ,
Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille
Ngonga Ngomo
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
2. Agenda
• Motivation
• Datasets
• Architecture
• Evaluation
• Requirements
• Demo
• Conclusion and Future Work
4. Triplification: Linked TCGA
• TCGA is publicly accessible atlas of cancer
related data from National Cancer Institute
(NCI)
– 9000 patients
– 33 cancer types
– 147,645 raw data files
– 12.7 TB
• Only 46% of the total expected data with
new data being submitted every day
• Goal is to enable cancer researchers to
make and validate important discoveries
• Total Linked TCGA > 30 billion triples
(Largest Dataset of LOD)
5. Triplification:PubMed
• Collection of publications from the bio-medical
domain
• Large amount of metadata (MESH Terms)
• 23+ million publications
• 10,000 new publications/month
6. Big Data Continuous Integration
TopFed
Parser
Federator Optimizer
Integrator
Results
SPARQL Query Results
Sub-query
PubMed
Entrez Utilities
RDFizer
Auto
Loader
TCGA Data
Portal
SPARQL
endpoint
RDF
SPARQL
endpoint
RDF
SPARQL
endpoint
RDF
Index
7. Exon-Expression
Methylation
C-1 ∨ Category
Colour = blue
For each query triple t(s, p, o) ∈ T
Highly Scalable
b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical}
M = {beta_value, position} F = {Expression-Exon}
(CNV, SNP, E-Gene,
miRNA,
E-Protein, Clinical)
D = {seg_mean, rpmmm, scaled_est, p_exp_val}
B = {DNA-Methylation}
C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧
!P-Join(p, M ∪ B ∪ E ∪ F) }}}
C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧
!P-Join(p, M ∪ B ∪ D ∪ C) }}}
C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧
!P-Join(p, E ∪ F ∪ D ∪ C) }}}
IF tumour lookup is successful
forward to corresponding
leaf
Else
broadcast to every one
A = {chromosome, result, bcr_patient_barcode} G = {start, stop}
E = {RPKM}
Tumours
SPARQL
endpoints
C-2 ∨ Category
Colour = pink
C-3 ∨ Category
Colour = green
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
8. Evaluation:Number of Sub-Query Submission
60
50
40
30
20
10
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission
• TopFed number of sub-queries submission is 1/3 to FedX
• Number of ASK requests
– FedX 480
– TopFed 10
0
1 2 3 4 5 6 7 8 9 10 Avg
9. Evaluation: Query Runtime
100000
10000
1000
100
10
1
1 2 3 4 5 6 7 8 9 10 Average
Query Execution Time (msec) in
log scale
FedX TopFed
• TopFed outperform FedX significantly on 90% of the queries
• On average, the query run time of TopFed is about 1/3 to that
of FedX
• TopFed‘s best run-time (query 2, query 3) is more than 75 times
smaller than that of FedX
10. Big Data Track Requirements
• Data Volume
– 7.36 billion triples from Linked TCGA
– 23 million publications from PubMed
• Data Variety
– The Linked TCGA data was extracted from raw text files of different
structures
– Processed the metadata associated with PubMed publications and
transform them into RDF
– Unstructured data (publication abstracts) is processed to extract
mentions of gene names and cancers
• Data Velocity
– TCGA data doubles /2 months
– PubMed publications 10k/month