This document discusses enabling the Resource Description Framework (RDF) within the Hadoop ecosystem. It describes existing open source projects like Apache Jena Hadoop RDF Tools and GraphBuilder that aim to fill gaps in representing RDF concepts and processing RDF data with Hadoop. The document also provides an overview of the components in Jena Hadoop RDF Tools, sample code, and encourages users to test the tools and provide feedback to the developer community.
Post Quantum Cryptography – The Impact on Identity
Quadrupling your elephants - RDF and the Hadoop ecosystem
1. 1
RDF and the Hadoop Ecosystem
Rob Vesse
Twitter: @RobVesse
Email: rvesse@apache.org
2. 2
Software Engineer at YarcData (part of Cray Inc)
Working on big data analytics products
Active open source contributor primarily to RDF & SPARQL
related projects
Apache Jena Committer and PMC Member
dotNetRDF Lead Developer
Primarily interested in RDF, SPARQL and Big Data Analytics
technologies
3. 3
What's missing in the Hadoop ecosystem?
What's needed to fill the gap?
What's already available?
Jena Hadoop RDF Tools
GraphBuilder
Other Projects
Getting Involved
Questions
5. 5
Apache, the projects and their logo shown here are registered trademarks or
trademarks of The Apache Software Foundation in the U.S. and/or other
countries
6. 6
No first class projects
Some limited support in other projects
E.g. Giraph supports RDF by bridging through the Tinkerpop stack
Some existing external projects
Lots of academic proof of concepts
Some open source efforts but mostly task specific
E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
8. 8
Need to efficiently represent RDF concepts as Writable
types
Nodes, Triples, Quads, Graphs, Datasets, Query Results etc
What's the minimum viable subset?
9. 9
Need to be able to get data in and out of RDF formats
Without this we can't use the power of the Hadoop
ecosystem to do useful work
Lots of serializations out there:
RDF/XML
Turtle
NTriples
NQuads
JSON-LD
etc
Also would like to be able to produce end results as RDF
10. 1
0
Map/Reduce building blocks
Common operations e.g. splitting
Enable developers to focus on their applications
User Friendly tooling
i.e. non-programmer tools
13. 1
3
Set of modules part of the Apache Jena project
Originally developed at Cray and donated to the project earlier this year
Experimental modules on the hadoop-rdf branch of our
Currently only available as development SNAPSHOT
releases
Group ID: org.apache.jena
Artifact IDs:
jena-hadoop-rdf-common
jena-hadoop-rdf-io
jena-hadoop-rdf-mapreduce
Latest Version: 0.9.0-SNAPSHOT
Aims to fulfill all the basic requirements for enabling RDF on
Hadoop
Built against Hadoop Map/Reduce 2.x APIs
14. 1
4
Provides the Writable types for RDF primitives
NodeWritable
TripleWritable
QuadWritable
NodeTupleWritable
All backed by RDF Thrift
A compact binary serialization for RDF using Apache Thrift
See http://afs.github.io/rdf-thrift/
Extremely efficient to serialize and deserialize
Allows for efficient WritableComparator implementations that perform binary comparisons
15. Provides InputFormat and OutputFormat implementations
1
5
Supports most formats that Jena supports
Designed to be extendable with new formats
Will split and parallelize inputs where the RDF serialization
is amenable to this
Also transparently handles compressed inputs and outputs
Note that compression blocks splitting
i.e. trade off between IO and parallelism
16. 1
6
Various reusable building block Mapper and Reducer
implementations:
Counting
Filtering
Grouping
Splitting
Transforming
Can be used as-is to do some basic Hadoop tasks or used as
building blocks for more complex tasks
18. For NTriples inputs compared performance of a Text based
node count versus RDF based node count
1
8
Performance as good (within 10%) and sometimes
significantly better
Heavily dataset dependent
Varies considerably with cluster setup
Also depends on how the input is processed
YMMV!
For other RDF formats you would struggle to implement
this at all
19. 1
9
Originally developed by Intel
Some contributions by Cray - awaiting merging at time of writing
Open source under Apache License
https://github.com/01org/graphbuilder/tree/2.0.alpha
2.0.alpha is the Pig based branch
Allows data to be transformed into graphs using Pig scripts
Provides set of Pig UDFs for translating data to graph formats
Supports both property graphs and RDF graphs
21. 2
1
Uses a declarative mapping based on Pig primitives
Maps and Tuples
Have to be explicitly joined to the data because Pig UDFs
can only be called with String arguments
Has some benefits e.g. conditional mappings
RDF Mappings operate on Property Graphs
Requires original data to be mapped to a property graph first
Direct mapping to RDF is a future enhancement that has yet to be implemented
23. 2
3
Infovore - Paul Houle
https://github.com/paulhoule/infovore/wiki
Cleaned and curated Freebase datasets processed with Hadoop
CumulusRDF - Institute of Applied Informatics and Formal
Description Methods
https://code.google.com/p/cumulusrdf/
RDF store backed by Apache Cassandra
24. 2
4
Please start playing with these projects
Please interact with the community:
dev@jena.apache.org
What works?
What is broken?
What is missing?
Contribute
Apache projects are ultimately driven by the community
If there's a feature you want please suggest it
Or better still contribute it yourself!
27. 2
7
> bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output
/user/output --input-type triples /user/input
--node-count requests the Node Count statistics be
calculated
Assumes mixed quads and triples input if no --input-type
specified
Using this for triples only data can skew statistics
e.g. can result in high node counts for default graph node
Hence we explicitly specify input as triples
33. 3
3
> ./pig -x local examples/property_graphs_and_rdf.pig
> cat /tmp/rdf_triples/part-m-00000
Running in local mode for this demo
Output goes to /tmp/rdf_triples
Tons of active projects
Accumulo, Ambari, Avro, Cassandra, Chukwa, Giraph, Ham, HBase, Hive, Mahout, Pig, Spark, Tez, ZooKeeper
And those are just off the top of my head (and ignoring Incubating projects)
However mostly focused on traditional data sources
e.g. logs, relational databases, unstructured data
Highlight benefit of WritableComparator - significant speed up in reduce phase
Project also provides a demo JAR which shows how to use the building blocks to perform common Hadoop tasks on RDF
So Node Count is essentially the Word Count "Hello World" of Hadoop programming
Mention that Intel may not have yet merged our pull request that adds the declarative mapping approach
~20 lines of code (less if you remove unnecessary formatting)