SlideShare a Scribd company logo
1 of 24
Validating statistical index data
represented in RDF using
SPARQL queries
WESO Research Group
University of Oviedo, Spain
Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
Motivation
The WebIndex Project
Measure impact of the Web in different countries
First publication: 2012, 2 web sites:
Visualization
http://thewebindex.org
Data portal
http://data.webfoundation.org
WebIndex Workflow
Raw Data
Excel
Computed data
Excel
External
sources
All data
RDF Datastore
Visualizations
Technical details
Index made from
61 countries (100 planned for 2013)
85 indicators:
51 Primary (questionnaires)
34 Secondary (external sources)
WebIndex RDF data
Modeled on top of RDF Data Cube
More than 1 million triples
Linked data: DBPedia, Organizations, etc.
Records data computation
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
WebIndex
computation process (1)
Simplified with one indicator, 3 years and 4 countries
Raw Data
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Imputed Data
Filtered Data (Indicator A)
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Normalized Data (z-scores)
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
More details can be found here: http://thewebindex.org/about/methodology/computation/
WebIndex
computation Process (2)
Simplified with one indicator, 3 years and 4 countries
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Normalized Data (z-scores)
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
More details can be found here: http://thewebindex.org/about/methodology/computation/
Adjusted data
Country A B C D ...
Spain 8 7 9.1 7.1 ...
Finland 7 8 7.1 8 ...
Chile 8 9 7.6 6 ...
Group indicators
Country Readiness Impact Web Composite
Spain 5.7 3.5 5.1 4.5
Finland 5.5 3.9 7.1 4.9
Chile 6.7 4.5 7.6 5.1
Rankings
Country Readiness Impact Web Composite
Spain 2 3 3 3
Finland 3 2 2 2
Chile 1 1 1 1
Example of data
representation in RDF
Example:
obs:obsM23 a qb:Observation ;
cex:computation
[ a cex:Z-Score ;
cex:observation obs:obsA23 ;
cex:slice slice:sliceA09 ;
] ;
cex:value -0.57 ;
cex:md5-checksum "2917835203..." ;
cex:indicator indicator:A ;
cex:concept country:ESP ;
qb:dataSet dataset:A-Normalized ;
# ... other declarations omitted for brevity
It declares that the value
of this observation was obtained as
z-score of obs:obsA23 over
slice:sliceA09
Each observation follows the RDF Data Cube vocabulary extented with
metadata about how it was obtained
Computex
Vocabulary for statistical computations
Extends RDF Data Cube
Ontology at: http://purl.org/weso/computex
Some terms:
cex:Concept
cex:Indicator
cex:Computation
cex:WeightSchema
qb:Observation
...
Validation process
Last year (2012)
Template based validation
MD5 checksum of each observation and some fields
This year (2013)
SPARQL based validation
3 levels of validation
RDF Data Cube
Shapes of data
Computation process
Ultimate goal: automatically compute the index
Validation approach
SPARQL CONSTRUCT queries instead of ASK
IF no error THEN empty model
ELSE RDF graph with error
informationCONSTRUCT {
[ a cex:Error ;
cex:errorParam
[cex:name "obs"; cex:value ?obs ] ,
[cex:name "value1"; cex:value ?value1 ] ,
[cex:name "value2"; cex:value ?value2 ] ;
cex:msg "Observation has two different values" . ]
} WHERE {
?obs a qb:Observation .
?obs cex:value ?value1 .
?obs cex:value ?value2 .
FILTER ( ?value1 != ?value2 )
}
ASK WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL queries
RDF Data Cube
We converted RDF Data Cube integrity
constraints from ASK to CONSTRUCT queries
CONSTRUCT {
[ a cex:Error ;
cex:errorParam [cex:name "dim"; cex:value ?dim ] ;
cex:msg "Every Dimension must have a declared range" .
]
} WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL expressivity
SPARQL can express complex validation patterns.
Example: Mean
CONSTRUCT {
[ a cex:Error ;
cex:errorParam # ...omitted
cex:msg "Mean value does not match" ] .
} WHERE {
?obs a qb:Observation ;
cex:computation ?comp ;
cex:value ?val .
?comp a cex:Mean .
{ SELECT (AVG(?value) as ?mean) ?comp WHERE {
?comp cex:observation ?obs1 .
?obs1 cex:value ?value ;
} GROUP BY ?comp }
FILTER( abs(?mean - ?val) > 0.0001) }}
Limitations of
SPARQL expressivity
Some built-in functions are not standardized
Example: z-score employs standard deviation.
It requires built-in function: sqrt
Available in some SPARQL implementations
Ranking of values was not obvious (*)
2 solutions:
Using GROUP_CONCAT
Check a value against all the other values
Neither solution is efficient
(*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
SPARQL expressivity
CONSTRUCT { # ... omitted for brevity
} WHERE {
?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ;
cex:value ?val .
?ls rdf:first [cex:value ?v1 ] .
{ SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth)
WHERE {
?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ;
rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] .
}
} FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001)
} * This solution was provided by Joshua Taylor
RDF Profiles
Descriptions (with constraints) of dataset families
Can be used to validate (and compute) RDF graphs
Examples:
Cube (Statistical data)
Normalization Update queries +
Integrity constraints
Computex (statistical computations)
Adds more queries
Other user defined profiles
Cube
Computex
WebIndexCloudIndex
Cube Profile
W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/)
1 base ontology + RDF Schema inference
2 update queries (normalization algorithm)
21 integrity constraint queries
cex:cube a cex:ValidationProfile ;
cex:name "RDF Data Cube" ;
cex:ontologyBase cube:cube.ttl ;
cex:expandSteps (
[ cex:name "Closure" ; cex:uri cube:closure.ru ]
[ cex:name "Flatten" ; cex:uri cube:flatten.ru ]
) ;
cex:integrityQuery
[ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ];
# ...
.
Computex Profile
1 base ontology
http://purl.org/weso/computex
Imports RDF Data Cube
cex:computex a cex:ValidationProfile ;
cex:name "Computex" ;
cex:ontologyBase cex:computex.ttl ;
cex:import profile:cube.ttl ;
cex:expandSteps
( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ]
# Other UPDATE queries
) ;
cex:integrityQuery
[ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] ,
# Other integrity queries
Computex Validation tool
Available at:
http://computex.herokuapp.com
Features
Informative error messages
Option to generate EARL report
2 validation profiles (Cube and Computex)
Source code in Scala available at:
http://github.com/weso/computex
Based on profiles
Computex Validation Tool
http://computex.herokuapp.com
Conclusions
WebIndex = use case for RDF Validation
SPARQL queries as an implementation approach
Challenges:
Expressivity limits of SPARQL
Complexity of some queries
Concept of RDF Profiles
Declarative, Turtle syntax
Intermediate level between OWL and SPARQL
END OF PRESENTATION
COMPLEMENTARY SLIDES
Limits of
SPARQL expressivity
Ranking of values using GROUP_CONCAT
SELECT ?x ?v ?ranking {
?x :value ?v .
{ SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) {
{ SELECT ?x {
?x :value ?v .
} ORDER BY DESC(?v)
}
}
}
BIND (str(?x) as ?xName)
BIND (strbefore(?ordered,?xName) as ?before)
BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking)
} ORDER BY ?ranking
Limits of
SPARQL expressivity
Ranking of values comparing all values
SELECT ?x ?v (COUNT(*) as ?ranking) WHERE {
?x :value ?v .
[] :value ?u .
FILTER( ?v <= ?u )
}
GROUP BY ?x ?v
ORDER BY ?ranking
* This solution was provided by Joshua Taylor

More Related Content

What's hot

Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsElasticsearch
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Write Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayWrite Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Review of some successes
Review of some successesReview of some successes
Review of some successesAndrea Zaliani
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
ST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingSimone Campora
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 

What's hot (20)

Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Write Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayWrite Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew Ray
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Review of some successes
Review of some successesReview of some successes
Review of some successes
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
ST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data Warehousing
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 

Similar to Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataJose Emilio Labra Gayo
 
RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applicationsJose Emilio Labra Gayo
 
ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013Keith Washer
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsTeamstudio
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinDavid Morin
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...Yann Pauly
 
Rennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionRennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionDavid Morin
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 

Similar to Validating statistical Index Data represented in RDF using SPARQL Queries: Computex (20)

Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked data
 
RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applications
 
ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013
 
70487.pdf
70487.pdf70487.pdf
70487.pdf
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational Controls
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - Berlin
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
 
Rennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionRennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in production
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Jacob Keecheril
Jacob KeecherilJacob Keecheril
Jacob Keecheril
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 

More from Jose Emilio Labra Gayo

Introducción a la investigación/doctorado
Introducción a la investigación/doctoradoIntroducción a la investigación/doctorado
Introducción a la investigación/doctoradoJose Emilio Labra Gayo
 
Challenges and applications of RDF shapes
Challenges and applications of RDF shapesChallenges and applications of RDF shapes
Challenges and applications of RDF shapesJose Emilio Labra Gayo
 
Legislative data portals and linked data quality
Legislative data portals and linked data qualityLegislative data portals and linked data quality
Legislative data portals and linked data qualityJose Emilio Labra Gayo
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesJose Emilio Labra Gayo
 
Legislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesLegislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesJose Emilio Labra Gayo
 
Como publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosComo publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosJose Emilio Labra Gayo
 
Arquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorArquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorJose Emilio Labra Gayo
 

More from Jose Emilio Labra Gayo (20)

Publicaciones de investigación
Publicaciones de investigaciónPublicaciones de investigación
Publicaciones de investigación
 
Introducción a la investigación/doctorado
Introducción a la investigación/doctoradoIntroducción a la investigación/doctorado
Introducción a la investigación/doctorado
 
Challenges and applications of RDF shapes
Challenges and applications of RDF shapesChallenges and applications of RDF shapes
Challenges and applications of RDF shapes
 
Legislative data portals and linked data quality
Legislative data portals and linked data qualityLegislative data portals and linked data quality
Legislative data portals and linked data quality
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectives
 
Wikidata
WikidataWikidata
Wikidata
 
Legislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesLegislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologies
 
ShEx by Example
ShEx by ExampleShEx by Example
ShEx by Example
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
 
Introducción a la Web Semántica
Introducción a la Web SemánticaIntroducción a la Web Semántica
Introducción a la Web Semántica
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
2017 Tendencias en informática
2017 Tendencias en informática2017 Tendencias en informática
2017 Tendencias en informática
 
RDF, linked data and semantic web
RDF, linked data and semantic webRDF, linked data and semantic web
RDF, linked data and semantic web
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
 
19 javascript servidor
19 javascript servidor19 javascript servidor
19 javascript servidor
 
Como publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosComo publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazados
 
16 Alternativas XML
16 Alternativas XML16 Alternativas XML
16 Alternativas XML
 
XSLT
XSLTXSLT
XSLT
 
XPath
XPathXPath
XPath
 
Arquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorArquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el Servidor
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

  • 1. Validating statistical index data represented in RDF using SPARQL queries WESO Research Group University of Oviedo, Spain Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
  • 2. Motivation The WebIndex Project Measure impact of the Web in different countries First publication: 2012, 2 web sites: Visualization http://thewebindex.org Data portal http://data.webfoundation.org
  • 3. WebIndex Workflow Raw Data Excel Computed data Excel External sources All data RDF Datastore Visualizations
  • 4. Technical details Index made from 61 countries (100 planned for 2013) 85 indicators: 51 Primary (questionnaires) 34 Secondary (external sources) WebIndex RDF data Modeled on top of RDF Data Cube More than 1 million triples Linked data: DBPedia, Organizations, etc. Records data computation
  • 5. Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 WebIndex computation process (1) Simplified with one indicator, 3 years and 4 countries Raw Data Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Imputed Data Filtered Data (Indicator A) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/
  • 6. WebIndex computation Process (2) Simplified with one indicator, 3 years and 4 countries Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/ Adjusted data Country A B C D ... Spain 8 7 9.1 7.1 ... Finland 7 8 7.1 8 ... Chile 8 9 7.6 6 ... Group indicators Country Readiness Impact Web Composite Spain 5.7 3.5 5.1 4.5 Finland 5.5 3.9 7.1 4.9 Chile 6.7 4.5 7.6 5.1 Rankings Country Readiness Impact Web Composite Spain 2 3 3 3 Finland 3 2 2 2 Chile 1 1 1 1
  • 7. Example of data representation in RDF Example: obs:obsM23 a qb:Observation ; cex:computation [ a cex:Z-Score ; cex:observation obs:obsA23 ; cex:slice slice:sliceA09 ; ] ; cex:value -0.57 ; cex:md5-checksum "2917835203..." ; cex:indicator indicator:A ; cex:concept country:ESP ; qb:dataSet dataset:A-Normalized ; # ... other declarations omitted for brevity It declares that the value of this observation was obtained as z-score of obs:obsA23 over slice:sliceA09 Each observation follows the RDF Data Cube vocabulary extented with metadata about how it was obtained
  • 8. Computex Vocabulary for statistical computations Extends RDF Data Cube Ontology at: http://purl.org/weso/computex Some terms: cex:Concept cex:Indicator cex:Computation cex:WeightSchema qb:Observation ...
  • 9. Validation process Last year (2012) Template based validation MD5 checksum of each observation and some fields This year (2013) SPARQL based validation 3 levels of validation RDF Data Cube Shapes of data Computation process Ultimate goal: automatically compute the index
  • 10. Validation approach SPARQL CONSTRUCT queries instead of ASK IF no error THEN empty model ELSE RDF graph with error informationCONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "obs"; cex:value ?obs ] , [cex:name "value1"; cex:value ?value1 ] , [cex:name "value2"; cex:value ?value2 ] ; cex:msg "Observation has two different values" . ] } WHERE { ?obs a qb:Observation . ?obs cex:value ?value1 . ?obs cex:value ?value2 . FILTER ( ?value1 != ?value2 ) }
  • 11. ASK WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } } SPARQL queries RDF Data Cube We converted RDF Data Cube integrity constraints from ASK to CONSTRUCT queries CONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "dim"; cex:value ?dim ] ; cex:msg "Every Dimension must have a declared range" . ] } WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } }
  • 12. SPARQL expressivity SPARQL can express complex validation patterns. Example: Mean CONSTRUCT { [ a cex:Error ; cex:errorParam # ...omitted cex:msg "Mean value does not match" ] . } WHERE { ?obs a qb:Observation ; cex:computation ?comp ; cex:value ?val . ?comp a cex:Mean . { SELECT (AVG(?value) as ?mean) ?comp WHERE { ?comp cex:observation ?obs1 . ?obs1 cex:value ?value ; } GROUP BY ?comp } FILTER( abs(?mean - ?val) > 0.0001) }}
  • 13. Limitations of SPARQL expressivity Some built-in functions are not standardized Example: z-score employs standard deviation. It requires built-in function: sqrt Available in some SPARQL implementations Ranking of values was not obvious (*) 2 solutions: Using GROUP_CONCAT Check a value against all the other values Neither solution is efficient (*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
  • 14. SPARQL expressivity CONSTRUCT { # ... omitted for brevity } WHERE { ?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ; cex:value ?val . ?ls rdf:first [cex:value ?v1 ] . { SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth) WHERE { ?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ; rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] . } } FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001) } * This solution was provided by Joshua Taylor
  • 15. RDF Profiles Descriptions (with constraints) of dataset families Can be used to validate (and compute) RDF graphs Examples: Cube (Statistical data) Normalization Update queries + Integrity constraints Computex (statistical computations) Adds more queries Other user defined profiles Cube Computex WebIndexCloudIndex
  • 16. Cube Profile W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/) 1 base ontology + RDF Schema inference 2 update queries (normalization algorithm) 21 integrity constraint queries cex:cube a cex:ValidationProfile ; cex:name "RDF Data Cube" ; cex:ontologyBase cube:cube.ttl ; cex:expandSteps ( [ cex:name "Closure" ; cex:uri cube:closure.ru ] [ cex:name "Flatten" ; cex:uri cube:flatten.ru ] ) ; cex:integrityQuery [ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ]; # ... .
  • 17. Computex Profile 1 base ontology http://purl.org/weso/computex Imports RDF Data Cube cex:computex a cex:ValidationProfile ; cex:name "Computex" ; cex:ontologyBase cex:computex.ttl ; cex:import profile:cube.ttl ; cex:expandSteps ( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ] # Other UPDATE queries ) ; cex:integrityQuery [ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] , # Other integrity queries
  • 18. Computex Validation tool Available at: http://computex.herokuapp.com Features Informative error messages Option to generate EARL report 2 validation profiles (Cube and Computex) Source code in Scala available at: http://github.com/weso/computex Based on profiles
  • 20. Conclusions WebIndex = use case for RDF Validation SPARQL queries as an implementation approach Challenges: Expressivity limits of SPARQL Complexity of some queries Concept of RDF Profiles Declarative, Turtle syntax Intermediate level between OWL and SPARQL
  • 23. Limits of SPARQL expressivity Ranking of values using GROUP_CONCAT SELECT ?x ?v ?ranking { ?x :value ?v . { SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) { { SELECT ?x { ?x :value ?v . } ORDER BY DESC(?v) } } } BIND (str(?x) as ?xName) BIND (strbefore(?ordered,?xName) as ?before) BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking) } ORDER BY ?ranking
  • 24. Limits of SPARQL expressivity Ranking of values comparing all values SELECT ?x ?v (COUNT(*) as ?ranking) WHERE { ?x :value ?v . [] :value ?u . FILTER( ?v <= ?u ) } GROUP BY ?x ?v ORDER BY ?ranking * This solution was provided by Joshua Taylor