WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.
How AI, OpenAI, and ChatGPT impact business and software.
OCLC migrates Apache HBase to manage WorldCat data
1. Apache HBase at OCLC
Ron Buckley
May 22, 2012
ron_buckley@oclc.org
2. About OCLC
OCLC delivers single-search-box access to more than 943 million items
from your library and the world's library collections. You'll find:
1.8 Billion Ownership Information indications
214+ million books in libraries worldwide
663+ million articles with one-click access to full text
28+ million digital items from trusted sources like Google Books, OAIster
and HathiTrust
13+ million eBooks from leading aggregators and publishers
44+ million pieces of evaluative content (Tables of Contents, cover art,
summaries, etc.) included at no additional charge
And a LOT more (Facilitate Interlibrary Loan, API access, library centric
research)
3. Main Case for OCLC
• Library gets a new book.
• Librarian needs to enter all the data about that item into their local
system.
• It takes quite some time to correctly enter cataloging data into local
system.
• Thousands of libraries are all going to get the same book and do the
same things . Thereby replicating each others work.
• There should be a system whereby libraries can share and build on each
others work.
• SaaS before buzzwords were cool. System proposed in July 1966. First
use in 1971.
*A member of the HBase implementation team also worked on the initial
OCLC system.
4. Current Data State at OCLC
• Oracle (WorldCat – Oracle RAC)
• SAN Storage (Approximately 20 TB)
• Several other smaller instances of Oracle
• A LOT of stored procedures for read and update. The most
commonly used are 10 years old and difficult to follow (being
polite)
• Two copies of the primary database in other formats, various
processes to keep them in sync (or not)
5. Schema Design – Oracle Version
4 Main Tables, Primary Key (xwcmd_id) is an ever increasing OCLC
assigned number for every library resource.
6. Schema Design – HBase Version
4 Tables become 1
Use Columns as data
7. Using column qualifiers to represent library
ownership
hbase(main):001:0> get 'Worldcat','1‘
data:createDate value=19690526 00:00:00.000
data:hold:10810 value={"md":[{"CDATE":"20080410 15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411 15:05:28.000"},{"UPID":"NA"}]}
data:hold:1100 value={"md":[{"CDATE":"20040826 02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826 02:08:57.000"},{"UPID":"NA"}]}
Qualifier Value
data:hold:10810 "md":[{"CDATE":"20080410
15:38:45.000"},{"CPID":"NA"},{"UDATE":“20080411
15:05:28.000"},{"UPID":"NA"}]}
data:hold:1100 "md":[{"CDATE":"20040826
02:08:57.000"},{"CPID":"NA"},{"UDATE":“20040826
02:08:57.000"},{"UPID":"NA"}]}
data:hold:727 "md":[{"CDATE":"20120522:08:57.000"},{"CPID":"NA"
},{"UDATE":“20120522:08:57.000"},{"UPID":"NA"}]}
8. Advantages
• Everything in one I/O – We get the record, all of its metadata and a
complete set of „who owns it and for how long‟, in one call to
HBase. HBase can generally read it in 1 physical I/O.
• New requirements – The existing Oracle table is binary indicator
of „I own this‟. Adding new columns to the table was going to be
very difficult.
• With HBase, we‟re now storing complete ownership, by just
making up new column qualifiers.
9. Problems
Nagle – We‟ve disabled Nagle across the board.
HBase Balancer – We‟ve written a script that balances (outside of the
default balancer) at the table. Hoping that the “Allow regions to be
load-balanced by table” is included in 0.94 (HBASE-3373)
IOPS – For us, HBase is used for online, user facing traffic. Our cluster
is designed such that we have plenty of capacity for this use. It‟s
easy for Map Reduce activity to fully utilize the amount of IO that‟s
available and not leave HBase anything to work with.
10. Status – Hardware/Software Systems
Production Cluster
• 50 Nodes – 3 „Control‟ Nodes, 3 „edge‟ Nodes, 44 Data Nodes
• 8 CPU/32 GB Ram/8 TB Disk
• 3 Rack configuration – 10 GB interconnects
6 Node Clusters – Used for testing and disaster recovery
• 2 development clusters – IntegrationTest, ProofOfConcept
• 2 clusters in a separate datacenter – Business Continuity, Pre-
production Testing
Versions
• Cloudera Distribution 3 Update 3 – CDH3U3
• Apache HBase 0.92.1
11. Backup/Restore
We‟ve built our own backup/restore capability, like that described in:
https://issues.apache.org/jira/browse/HBASE-4618
It allow for both inter and intra-datacenter backup and restores.
On github at:
https://github.com/oclc/HBase-Backup
The backup runs weekly and on demand.
12. Other Interesting Data Sets OCLC is moving to
HBase
• The Dewey Decimal Editorial System - The system where the editors
of the Dewey Decimal System do their work.
• VIAF - "Virtual International Authority File" - A joint project of
several national libraries plus selected regional and trans-national
library agencies. The project's goal is to lower the cost and increase
the utility of library authority files by matching and linking widely-
used authority files and making that information available on the
Web.