Speaker: Ron Buckley (OCLC)
OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
Digital Library Collection Management using HBase
1. The world’s libraries. Connected.
Digital Library Collection
Management using HBase
“AKA: A Success Story”
Case Studies
Ron Buckley
HBaseCon
May 5, 2014
2. The world’s libraries. Connected.
About OCLC
Worldwide, member-owned library cooperative
• Based in Dublin, Ohio
• Founded in 1967
• Not-for -profit
Worldcat
• Union catalog of library items from 72,000 libraries in 170 countries
• Over 2 billion records, 2.5 billions location listings
Hosting
• Melvyl, University of California Digital Library (and many others) are hosted directly out of
Worldcat
3. The world’s libraries. Connected.
Center of our world
• 15 month project to rebuild data infrastructure with Hadoop at the center.
• Leveraged HBase to build multiple new products.
• Replaced and decommissioned multiple Oracle RAC environments.
Old Meets New
• Dewey Decimal System – OCLC owns and maintains the Dewey Decimal
System. The Dewey Decimal System is stored in and maintain in HBase.
HBase @ OCLC
4. The world’s libraries. Connected.
Why
• Data set was too big a long time ago – Not long after we built our Oracle
database we removed almost all joins and views.
• Too expensive – Making a dataset available for free open-access was going to
cost us almost $1 Million, just for storage
• Slow – Couldn’t analyze data set because it took a week just to walk it.
How
• Text index and our own secondary indexing for Hbase
• Transition period of about 12 months with both - Multiple tools built and run find
and fix discrepancies.
Moving from Relational to HBase
6. The world’s libraries. Connected.
HBase - Hub of Linked Data
It is imperative that library data be available in new data
formats that are native to the web.
• Databases are walked and analyzed frequently
• Many hundreds of millions, soon billions, of interrelated
endpoints are stored back to HBase.
• Endpoints are made available through multiple standard
protocols (RDF,JSON,Turtle, N-Triple) for machine use.
- Tim Berners Lee
7. The world’s libraries. Connected.
HBase - Hub of Linked Data
http://experiment.worldcat.org/entity/work/data/1151002411.html
8. The world’s libraries. Connected.
“Libraries aren’t just about books”
• OCLC Contentdm is used by 1000’s of libraries to manage local digital content
preservation.
• We’re moving over 40 millions digital objects (many TB’s) into a centrally hosted
HBase repository.
HBase as Content Store
9. The world’s libraries. Connected.
• Key – Internal Key is MD5 hashed into HBase key.
• PDF’s - Compression (snappy) doesn’t reduce the size of PDF documents.
• 10 MB cellsize - Objects over 10 MB are not being stored in HBase. We’re storing
them in HDFS. (We do store Metadata Rows for these objects in HBase.)
Digital storage in HBase
17. The world’s libraries. Connected.
• We run hundreds of M/R jobs a day on our user facing cluster.
• Our cluster is oversized for HBase
• M/R jobs run with limited tasks, niced,…
• Still faster than “the old way”
• Looking forward to multi-tenant features in upcoming releases
M/R and HBase?
18. The world’s libraries. Connected.
- We needed a way to upgrade HBase, without
downtime.
- Rolling installs on a 50-Node cluster sounded
cumbersome
Upgrading HBase
19. The world’s libraries. Connected.
• HBase Master-Master replication is used to maintain an always available disaster site.
• We have a middle tier service layer (like the thrift server) that knows about both our
main cluster and our DR cluster.
• When we shutdown the main cluster, the middle tier automatically switches to disaster
site.
• Each cluster runs a web server that exposes it’s hadoop config.
• Example: http://HBase-config-perf.ent.oclc.org:9007/HBaseconf/HBase-site.xml
Replication for 0 downtime install
20. The world’s libraries. Connected.
• Instead of relying on HBase-site.xml in the classpath, we load the HBase-site.xml via
addResource.
public HBaseManagedConnection(String HBaseSiteUrl, int maxPoolSize) {
tableCounter = new BlockingCounter(maxPoolSize);
Configuration config = HBaseConfiguration.create();
try {
config.addResource(new URL(HBaseSiteUrl));
} catch (MalformedURLException mue) {
LOG.error("**** URL to HBase Site is invalid, Unable to connect to HBase: {} *****", HBaseSiteUrl);
}
Replication for 0 downtime install
21. The world’s libraries. Connected.
Summary
• HBase is the center of our world. By association, a lot of libraries.
• You can move from relational to HBase.
• We’ve been successful running user facing traffic alongside
Map/Reduce.
• EASY to support. We have two converted Oracle DBA’s as our front
line admins. Mostly, they’re lent to MySQL support for other internal
systems.