The document presents a proposed schedule for developing and evaluating Rendezvous, a middleware for storing massive RDF graphs in NoSQL databases. Key aspects of Rendezvous include a workload-aware partitioning approach, mapping RDF data to different NoSQL data models, and a caching structure to accelerate query response. The evaluation compares Rendezvous to existing approaches using LUBM, a standard RDF benchmark, and finds that Rendezvous outperforms alternatives with graph-aware partitioning and near caching. Future work is planned to improve Rendezvous with compression, updates, additional NoSQL support, and more complex workloads.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
1. A Workload-Aware Middleware for Storing
Massive RDF Graphs into NoSQL Databases
Exame de Qualificação de Doutorado
Luiz Henrique Zambom Santana
Prof. Dr. Ronaldo dos Santos Mello
orientador
UFSC/CTC/INE/PPGCC
2. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
○ Open Issues
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
3. Introduction: Motivation
● RDF is currently widespread:
○ Best buy:
■ http://www.nytimes.com/external/readwriteweb/2010/07/01/01readwriteweb-how-best-bu
y-is-using-the-semantic-web-23031.html
○ Globo.com:
■ https://www.slideshare.net/icaromedeiros/apresantacao-ufrj-icaro2013
○ US data.gov:
■ https://www.data.gov/developers/semantic-web
5. Introduction: Objectives
This PhD Thesis proposal presents Rendezvous, a
middleware for storing massing RDF graphs. This
middleware includes a novel data partitioning approach, a
fragmentation strategy that maps pieces of this RDF graph
into NoSQL databases with different data models, and a
caching structure that accelerate the querying response.
6. Introduction: Contributions
● (i) a mapping of RDF data to the columnar, document, and key/value NoSQL
data models (SADALAGE; FOWLER, 2012)
● (ii) a workload-aware partitioner based on the current graph structure and,
mainly, on the typical application workload
● (iii) a caching schema based on key/value databases for speeding up the
query response time
● (iv) an experimental evaluation that compares the current version of our
approach against two baselines (Rainbow (GU; HU; HUANG, 2015) and
ScalaRDF (HU et al., )) by considering Redis, Apache Cassandra and
MongoDB, the most popular key/value, columnar and document NoSQL
databases, respectively
7. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
○ Open Issues
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
10. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
○ Open Issues
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
11. State of the Art - No NoSQL Triplestores
WARP (h-hop replication), YARS, Hexastore
(multiple indexes), 4store, SPIDER, RDF-3X,
SHARD, SW-Store (vertical partition), SOLID,
SPOVC (horizontal partition), and S2X
13. State of the Art - NoSQL Triplestores
RDFJoin, RDFKB, Jena+HBase, Hive+HBase, CumulusRDF,
Rya, Stratustore, MAPSIN, H2RDF, AMADA, Trinity.RDF,
H2RDF+, MonetDBRDF, xR2RML, W3C RDF/JSON,
Rainbow, Sempala, PrestoRDF, RDFChain, Tomaszuk,
Bouhali, and Laurent, Papailiou et al., and, ScalaRDF.
14. State of the Art - Categories
● RDF/NoSQL Converters
● Polystores/Multimodel
● In-memory
Rainbow (GU; HU; HUANG, 2015)
Amada (Aranda-Andújar, 2012)
15. State of the Art
● BUGIOTTI, F. et al. Invisible glue:
scalable self-tuning multi-stores. In:
Conference on Innovative Data
Systems Research (CIDR). [S.l.:
s.n.], 2015.
16. State of the Art - Open Issues
● To avoid indexing all the triple component permutations
● To consider workload and the usage of statistics for data partitioning
● To exploit in-memory possibilities
● To combine RDF storage with multiple NoSQL models
17. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
○ Open Issues
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
37. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
○ Open Issues
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
38. Evaluation
● LUBM: ontology for the University domain, synthetic RDF data scalable to any
size, and 14 extensional queries representing a variety of properties
● Generated dataset with 4000 universities (around 100 GB and contains
around 500 million triples)
● 12 queries with joins, all of them have at least one subject-subject join, and
six of them also have at least one subject-object join
● Apache Jena version 3.2.0 with Java 1.8, and we use Redis 3.2, MongoDB
3.4.3, and Apache Cassandra 3.10
● Amazon m3.xlarge spot with 7.5 GB of memory and 1 x 32 SSD capacity
41. Evaluation: Conclusions
● Fragments are scalable
● Bigger boundaries are not necessarily related to bigger
storage size
● Graph-aware partitions are better than NoSQL partitions
● Near cache is fast but it makes more difficult to keep data
consistency
42. Evaluation: Future Work
● Compression of triples during the storage
● Update and delete operations
● Other NoSQL types (e.g., graph)
● Better datasets
43. Agenda
● Introduction: Motivation, objectives, and contributions
● Background
○ RDF
○ NoSQL
● State of the Art
○ Open Issues
● Rendezvous
○ Storing: Fragmentation, Indexing, Partitioning, and Mapping
○ Querying: Query decomposition and Caching
● Evaluation
● Schedule
44. Schedule
● Middleware development (continuously until 2018)
○ Compression
○ Graph database
○ More complex and abstract workload awareness
● Submission of papers (continuously until 2018)
○ Special Interest Group On Management of Data (SIGMOD)
○ Very Large Databases (VLDB)
○ IEEE Transactions on Knowledge and Data Engineering (TKDE)
● Defense of the PhD thesis (2019)