OCLC migrates Apache HBase to manage WorldCat data

Apache HBase at OCLC

Ron Buckley
May 22, 2012
ron_buckley@oclc.org

About OCLC

OCLC delivers single-search-box access to more than 943 million items
from your library and the world's library collections. You'll find:

1.8 Billion Ownership Information indications

214+ million books in libraries worldwide

663+ million articles with one-click access to full text

28+ million digital items from trusted sources like Google Books, OAIster
and HathiTrust

13+ million eBooks from leading aggregators and publishers

44+ million pieces of evaluative content (Tables of Contents, cover art,
summaries, etc.) included at no additional charge

And a LOT more (Facilitate Interlibrary Loan, API access, library centric
research)

Main Case for OCLC

• Library gets a new book.

• Librarian needs to enter all the data about that item into their local
system.

• It takes quite some time to correctly enter cataloging data into local
system.

• Thousands of libraries are all going to get the same book and do the
same things . Thereby replicating each others work.

• There should be a system whereby libraries can share and build on each
others work.

• SaaS before buzzwords were cool. System proposed in July 1966. First
use in 1971.

*A member of the HBase implementation team also worked on the initial
OCLC system.

Current Data State at OCLC

• Oracle (WorldCat – Oracle RAC)

• SAN Storage (Approximately 20 TB)

• Several other smaller instances of Oracle

• A LOT of stored procedures for read and update. The most
commonly used are 10 years old and difficult to follow (being
polite)

• Two copies of the primary database in other formats, various
processes to keep them in sync (or not)

Schema Design – Oracle Version

4 Main Tables, Primary Key (xwcmd_id) is an ever increasing OCLC
assigned number for every library resource.

Schema Design – HBase Version

4 Tables become 1

Use Columns as data

Using column qualifiers to represent library
ownership
hbase(main):001:0> get 'Worldcat','1‘

data:createDate value=19690526 00:00:00.000

data:hold:10810 value={"md":[{"CDATE":"20080410 15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411 15:05:28.000"},{"UPID":"NA"}]}

data:hold:1100 value={"md":[{"CDATE":"20040826 02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826 02:08:57.000"},{"UPID":"NA"}]}

Qualifier Value

data:hold:10810 "md":[{"CDATE":"20080410
15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411
15:05:28.000"},{"UPID":"NA"}]}
data:hold:1100 "md":[{"CDATE":"20040826
02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826
02:08:57.000"},{"UPID":"NA"}]}
data:hold:727 "md":[{"CDATE":"20120522:08:57.000"},{"CPID":"NA"
},{"UDATE":“20120522:08:57.000"},{"UPID":"NA"}]}

Advantages

• Everything in one I/O – We get the record, all of its metadata and a
complete set of „who owns it and for how long‟, in one call to
HBase. HBase can generally read it in 1 physical I/O.

• New requirements – The existing Oracle table is binary indicator
of „I own this‟. Adding new columns to the table was going to be
very difficult.

• With HBase, we‟re now storing complete ownership, by just
making up new column qualifiers.

Problems

Nagle – We‟ve disabled Nagle across the board.

HBase Balancer – We‟ve written a script that balances (outside of the
default balancer) at the table. Hoping that the “Allow regions to be
load-balanced by table” is included in 0.94 (HBASE-3373)

IOPS – For us, HBase is used for online, user facing traffic. Our cluster
is designed such that we have plenty of capacity for this use. It‟s
easy for Map Reduce activity to fully utilize the amount of IO that‟s
available and not leave HBase anything to work with.

Status – Hardware/Software Systems

Production Cluster

• 50 Nodes – 3 „Control‟ Nodes, 3 „edge‟ Nodes, 44 Data Nodes

• 8 CPU/32 GB Ram/8 TB Disk

• 3 Rack configuration – 10 GB interconnects

6 Node Clusters – Used for testing and disaster recovery

• 2 development clusters – IntegrationTest, ProofOfConcept

• 2 clusters in a separate datacenter – Business Continuity, Pre-
production Testing

Versions

• Cloudera Distribution 3 Update 3 – CDH3U3

• Apache HBase 0.92.1

Backup/Restore

We‟ve built our own backup/restore capability, like that described in:

https://issues.apache.org/jira/browse/HBASE-4618

It allow for both inter and intra-datacenter backup and restores.

On github at:

https://github.com/oclc/HBase-Backup

The backup runs weekly and on demand.

Other Interesting Data Sets OCLC is moving to
HBase

• The Dewey Decimal Editorial System - The system where the editors
of the Dewey Decimal System do their work.

• VIAF - "Virtual International Authority File" - A joint project of
several national libraries plus selected regional and trans-national
library agencies. The project's goal is to lower the cost and increase
the utility of library authority files by matching and linking widely-
used authority files and making that information available on the
Web.

OCLC migrates Apache HBase to manage WorldCat data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a OCLC migrates Apache HBase to manage WorldCat data

Similar a OCLC migrates Apache HBase to manage WorldCat data (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

OCLC migrates Apache HBase to manage WorldCat data