Data model for analysis of scholarly documents in the MapReduce paradigm
1. Data model for analysis of scholarly documents in the
MapReduce paradigm
Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan Dendek
Dominika Tkaczyk
Centre for Open Science (CeON), ICM UW
Warsaw, July 6, 2012
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19
2. Agenda
1 Problem definition
2 Requirements specification
3 Exemplary solutions based on Apache Hadoop Ecosystem tools
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 2 / 19
3. The data that we are in possession
Vast collections of scholarly documents to store
10 million of full texts
(PDF, plain text)
17 million of document metadata records
(described in XML-based BWMeta format)
4TB of data
(10TB including data archives)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 3 / 19
4. The tasks that we are doing
Big knowledge to extract and discover
17 million of document metadata records (XML)
contain title, subtitles, abstract, keywords, references, contributors and
their affiliations, publishing magazine, . . .
input for many state-of-the-art machine learning algorithms
relatively simple ones: searching documents with given title, finding
scientific teams, . . .
quite complex ones: author name disambiguation, bibliometrics,
classification code assignment, . . .
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 4 / 19
5. The requirements that we have specified
Multiple demands regarding storage and processing of large amounts of data:
scalability and parallelism — easily handle tens of terabytes of data and
parallelize the computation effectively
flexible data model — possibility to add or update data, and enhance it
content by implicit information discovered by our algorithms
latency requirements — support batch offline processing as well as
random, realtime read/write requests
availability of many clients — accessible by programmers and researchers
with diverse language preferences and expertise
reliablility and cost-effectiveness — ideally an open-source software which
does not require expensive hardware
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 5 / 19
6. Document-related data as linked data
Information about document-related resources can be simply described a directed
labeled graph
entities (e.g. documents, contributors, references) are nodes in the graph
relationships between entities are directed labeled edges in the graph
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 6 / 19
7. Linked graph as a collection of RDF triples
A directed labeled graph can be simply represented a collection of RDF triples
a triple consists of subject, predicate and object
a triple represents a statement which denotes that a resource (subject) holds
a value (object) for some attribute (predicate) of that resource
a triple can represent any statements about any resource
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 7 / 19
8. Hadoop as a solution for scalability/performance issues
Apache Hadoop is most commonly used open-source solution for storing and
processing big data in reliable, high-performance and cost-effective way.
Scalable storage
Parallel processing
Subprojects and many Hadoop-related projects
HDFS — distributed file system that provides high-throughput access
to large data
MapReduce — framework for distributed processing of large data sets
(Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)
HBase — scalable, distributed data store with flexible schema, random
read/write access and fast scans
Pig/Hive — higher-level abstractions on top of MapReduce (simple
data manipulation languages)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 8 / 19
9. Apache Hadoop Ecosystem tools as RDF triple stores
SHARD [3] — a Hadoop backed RDF triple store
stores triples in flat files in HDFS
data cannot be modified randomly
less efficient for queries that requires the inspection of only a small
number of triples
PigSPARQL [6] — translates SPARQL queries to Pig Latin programs and
runs them on Hadoop cluster
stores RDF triples with the same predicate in separate, flat files in
HDFS
H2RDF [5] — a RDF store that combines MapReduce with HBase
stores triples in HBase using three flat-wide tables
Jena-HBase [4] — a HBase backed RDF triple store
provides six different pluggable HBase storage layouts
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 9 / 19
10. HBase as storage layer for RDF triples
Storing RDF triples in Apache HBase has several advantages
flexible data model — columns can be dynamically added and removed;
multiple versions of data in a particular cell; data serialized to a byte array
random read and write — more suitable for semi-structured RDF data
than HDFS where files cannot be modified randomly and usually whole file
must be read sequentially to find subset of records
availability of many clients
interactive clients — native Java API, REST or Apache Thrift
batch clients — MapReduce (Java), Pig (PigLatin) and Hive
(HiveQL)
automatically sorted records — quick lookups and partial scans; joins as
fast (linear) merge-joins
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 10 / 19
11. Exemplary HBase schema — Flat-wide layout
Advantages
no prior knowledge about data is required
colocation of all information about a resource within a single row
support of multi-valued properties
support of reified statements (statements about statements)
Disadvantages
unlimited number of columns
increase of storage space
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 11 / 19
12. Exemplary HBase schema - Vertically Partitioned layout [1]
Advantages
support of multi-valued properties
support of reified statements (statements about statements)
storage space savings when compared to the previous layout
first-step (predicate-bound) pairwise joins as fast merge-joins
Disadvantages
increased number of joins
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 12 / 19
13. Exemplary HBase schema - Hexastore layout [2]
Advantages
support of multi-valued properties
support of reified statements (statements about statements)
first-step pairwise joins as fast merge-joins
Disadvantages
increased of number of joins
increase of storage space
complication of an update operation
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 13 / 19
14. HBase schema - other layout
Some derivative and hybrid layouts exist to combine the advantages of original
layouts
a combination of the vertically partitioned and the hexastore layout [4]
a combination of the flat-wide and the vertically partitioned layouts [4]
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 14 / 19
15. Challenges
a large number of join operations
relatively expensive
and practically cannot be avoided (at least for more complex queries)
but specialized join techniques can be used e.g. multi join, merge-sort
join, replicated join, skewed join
lack of a native support for cross-row atomicity (e.g. in the form of
transactions)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 15 / 19
16. Possible performance optimization techniques
property tables — properties often queried together are stored in the same
record for a quick access [8, 9]
materialized path expressions — precalculation and materialization of
the most commonly used paths through an RDF graph in advance [1, 2]
graph-oriented partitioning scheme [7]
take advantage of the spatial locality inherent in graph pattern
matching
higher replication of data that is on the border of any particular
partition (however, problematic for a graph that is modified)
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 16 / 19
17. The ways of processing data from HBase
Many various tools are integrated with HBase and can read data from and write
data to HBase tables
Java MapReduce
possibility to use our legacy Java code in map and reduce methods
delivers better perfromance than Apache Pig
Apache Pig
provides common data operations (e.g. filters, unions, joins, ordering)
and nested types (e.g. tuples, bags, maps)
supports multiple specialized joins implementation
possibility to run MapReduce jobs directly from PigLatin scripts
can be embeded in Python code
Interactive clients (e.g. Java API, REST or Apache Thrift)
interactive access to relatively small subset of our data by sending API
calls on demand e.g. a web-based client
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 17 / 19
18. Case study: author name disambiguation algorithm
The most complex algorithm that we have run over Apache HBase so far is
author name disambiguation algorithm.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 18 / 19
19. Thanks!
More information about CeON:
http://ceon.pl/en/research
c 2012 Adam Kawa. Ten dokument jest dostepny na licencji Creative Commons Uznanie autorstwa 3.0 Polska
Tre´´ licencji dostepna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/
sc
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
20. D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable
Semantic Web Data Management using vertical partitioning. In VLDB, pages
411–422, 2007.
C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing for
Semantic Web Data Management. In VLDB, pages 1008-1019, 2008.
K. Rohloff and R. Schantz. High-performance, massively scalable distributed
systems using the mapreduce software framework: The shard triple-store.
International Workshop on Programming Support Innovations for Emerging
Distributed Applications, 2010.
V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.
Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technical
report, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-
Ext/tech-report.pdf
N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:
Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the
21th International Conference on World Wide Web (WWW demo track),
Lyon, France, 2012.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19
21. A. Sch¨tzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: Mapping
a
SPARQL to Pig Latin. 3th International Workshop on Semantic Web
Information Management (SWIM 2011), in conjunction with the 2011 ACM
International Conference on Management of Data (SIGMOD 2011). Athens
(Greece).
J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDF
Graphs. VLDB Endowment, Volume 4 (VLDB 2011).
K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storage
and Retrieval in Jena2. In SWDB, pages 131–150.
K. Wilkinson. Jena property table implementation. In SSWS, 2006.
(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 19 / 19