Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Ensuring Data Integrity white paper
1. Ensuring Data Integrity in a Digital Preservation Archive
Future Perfect Conference 2012: Digital Preservation by Design (March 2012)
Abstract Digital Preservation and the Church
This paper discusses the challenges of, and History Department
some working solutions to, a key requirement of Today, the Church History Department
digital preservation—ongoing data integrity of has ultimate responsibility for preserving
the archive. The solutions were developed records of enduring value that originate
cooperatively by three vendors in conjunction from its ecclesiastical leaders and within the
with the Church of Jesus Christ of Latter-day various Church departments, the Church’s
Saints. State-of-the-art, in-drive data validation educational institutions, and its affiliations.
plays a key role in ensuring ongoing data To fulfill its responsibility, the Church
integrity. History Department has implemented a
Digital Records Preservation System
Introduction to the Church of Jesus (DRPS) that is based on Ex Libris Rosetta.
Christ of Latter-day Saints Rosetta provides configurable
preservation workflows and advanced
The Church of Jesus Christ of Latter-day
preservation planning functions, but only
Saints is a worldwide Christian church with
writes a single copy of an Archival
more than 14.4 million
Information Package1 (AIP) to a storage
members and 28,784
device for permanent storage. An
congregations. With
appropriate storage layer must be
headquarters in Salt
integrated with Rosetta in order to provide
Lake City, Utah
the full capabilities of a digital preservation
(USA), the Church
archive, including AIP replication.
operates 136 temples,
After investigating a host of potential
three universities, a
storage layer solutions, the Church History
business college, and thousands of
Department chose NetApp StorageGRID to
seminaries and institutes of religion around
provide the Information Lifecycle
the world that enroll more than 700,000
Management (ILM) capabilities that were
students in religious training.
desired. In particular, StorageGRID’s data
The Church has a scriptural mandate to
integrity, data resilience, and data
keep records of its proceedings and
replication capabilities were attractive.
preserve them for future generations.
In order to support ILM migration of
Accordingly, the Church has been creating
AIPs from disk to tape, StorageGRID
and keeping records since 1830, when it
utilizes IBM Tivoli Storage Manager (TSM)
was organized. A Church Historian’s Office
as an interface to tape libraries.
was formed in the 1840s, and in 1972 it was
renamed the Church History Department.
1
2. DRPS also employs software extensions
developed by Church Information and
Communications Services (shown in the
red boxes below and described later).
Nativity scene from biblevideos.lds.org
Each department has a detailed records
management plan that specifies which of its
collections are appraised as having
enduring value. Typically, less than a tenth
of the collections are targeted for
preservation. In the future, selected Church
websites will also be preserved. And, a
multi-petabyte backlog of audiovisual
collections is presently being ingested into
DRPS.
Architecture of the Church History Department’s
Digital Records Preservation System
(DRPS)
Data Corruption in a Digital
Within a decade, the Church anticipates Preservation Archive
that it will have generated a cumulative A critical requirement of a digital
archival capacity of more than 100 preservation system is the ability to
petabytes for a single copy of AIPs. continuously ensure data integrity of its
Therefore, the total cost of ownership of archive. This requirement differentiates a
DRPS archival storage must be minimized. tape archive from other tape farms.
An internal study showed that the total Modern IT equipment, including servers,
cost of automated tape cartridges would be storage, network switches and routers,
a third of the corresponding cost of disk incorporate advanced features to minimize
arrays. Therefore, the Church History data corruption. Nevertheless, undetected
Department currently uses automated tape errors still occur for a variety of reasons.
libraries for DRPS archival storage. Whenever data files are written, read,
Internal departments of the Church stored, transmitted over a network, or
generate multiple petabytes of records processed, there is a small but real
annually. Record format types range from possibility that corruption will occur.
documents and images to videos of the Causes range from hardware and software
birth, life, death, and resurrection of the failures to network transmission failures
Lord Jesus Christ that were given to the and interruptions. Bit flips (also called bit
world as a free gift last Christmas by the rot) within data stored on tape also cause
Church (available at biblevideos.lds.org). data corruption.
2
3. Recently, data integrity of the entire Church Information and Communications
DRPS tape archive was validated. This Services (ICS) that create SHA-1 fixity
validation run encountered a 3.3x10-14 bit information for producer files before they
error rate. are transferred to DRPS for ingest (see the
The USC Shoah Foundation Institute for DRPS architecture shown previously).
Visual History and Education has observed Within Rosetta, SHA-1 fixity checks are
a 2.3x10-14 bit error rate within its tape performed three times—(i) when the
archive, which required the preservation deposit server receives a Submission
team to flip back 1500 bits per 8 petabytes Information Package1 (SIP), (ii) during the
of archive capacity.2 SIP validation process, and (iii) when a file
These real life measurements—one taken is moved to permanent storage.
from a large archive and the other from a Rosetta also provides the capability to
relatively small archive—provide a credible perform fixity checks on files after they
estimation of the amount of data corruption have been written to permanent storage,
that will occur in a digital preservation tape but the ILM features of StorageGRID do not
archive. Therefore, working solutions must utilize this capability. Therefore,
be implemented to detect and correct these StorageGRID must take over control of the
bit errors. fixity information once files have been
ingested into the grid.
DRPS Solutions to Data Corruption By collaborating with Ex Libris on this
process, ICS and Ex Libris have been
In order to continuously ensure data
successful in making the fixity information
integrity of its tape archive, DRPS employs
hand off from Rosetta to StorageGRID.
fixity information.
This is accomplished with a web service
Fixity information is a checksum (i.e.,
developed by ICS that retrieves SHA-1 hash
integrity value) calculated by a secure hash
values generated independently by
algorithm to ensure data integrity of an AIP
StorageGRID when the files are written to
file throughout preservation workflows and
the StorageGRID gateway node. Ex Libris
after the file has been written to the archive.
developed a Rosetta plug-in that calls this
By comparing fixity values before and
web service and compares the StorageGRID
after records are written, transferred across
SHA-1 hash values with those in the
a network, moved or copied, DRPS can
Rosetta database, which are known to be
determine if data corruption has taken
correct.
place during the workflow or while the AIP
Turning now to the storage layer of
is stored in the archive. DRPS uses a variety
DRPS, StorageGRID is constructed around
of hash values, cyclic redundancy check
the concept of object storage. To ensure
values, and error-correcting codes for such
object data integrity, StorageGRID provides
fixity information.
a layered and overlapping set of protection
In order to implement fixity information
domains that guard against data corruption
as early as possible in the preservation
and alteration of files that are written to the
process, and thus minimize data errors,
grid.
DRPS provides ingest tools developed by
3
4. The highest level domain utilizes the minimizes resource use, but is not secure
SHA-1 fixity information discussed above. against deliberate alteration.
A SHA-1 hash value is generated for Second, a key-based hash value is
each AIP (or object) that Rosetta writes to appended. This value can be verified using
permanent storage (i.e., to StorageGRID). the key that is stored as part of the
Also called the Object Hash, the SHA-1 metadata managed by StorageGRID.
hash value is self-contained and requires no Although this hash value takes more
external information for verification. resources to implement than the CRC
Each object contains a SHA-1 object hash checksum described above, it is secure
of the StorageGRID formatted data that against all forms of tampering as long as
comprise the object. The object hash is the key is protected.
generated when the object is created (i.e., The CRC checksum is verified during
when the gateway node writes it to the first every StorageGRID object operation—i.e.,
storage node). store, retrieve, transmit, receive, access, and
To assure data integrity, the object hash background verification. But, as with the
is verified every time the object is stored object hash, the key-based hash value is
and accessed. Furthermore, a background only verified when the object is accessed.
verification process uses the SHA-1 object Once a file has been correctly written to a
hash to verify that the object, while stored StorageGRID storage node (i.e., its data
on disk, has neither become corrupt nor has integrity has been ensured through both
been altered by tampering. SHA-1 object hash and CRC fixity checks),
Underneath the SHA-1 object hash StorageGRID invokes the TSM Client
domain, StorageGRID also generates a running on the archive node server in order
Content Hash when the object is created. to write the file to tape.
Since objects consist of AIP content data As this happens, the SHA-1 (object hash)
plus StorageGRID metadata, the content fixity information is not handed off to TSM.
hash provides additional protection for AIP Rather, it is superseded with new fixity
content files. information composed of various cyclic
Because the content hash is not self- redundancy check values and error-
contained, it requires external information correcting codes that provide TSM end-to-
for verification, and therefore is checked end logical block protection when writing the
only when the object is accessed. file to tape.
Each StorageGRID object has a third and Thus the DRPS fixity information chain
fourth domain of data protection applied, of control is altered when StorageGRID
and two different types of protection are invokes TSM. Nevertheless, validation of
utilized. the file’s data integrity continues seamlessly
First, a cyclic redundancy check (CRC) until it is written to tape.
checksum is added that can be quickly
computed to verify that the object has not
been corrupted or accidentally altered. This
CRC provides a verification process that
4
5. The process begins when the TSM server C1 code can be checked once again to verify
calculates and appends a CRC value to each the written data.
logical block of the file before transferring it A successful read-while-write operation
to a tape drive for writing. Each appended assures that no data corruption has
CRC is called the “original data CRC” for occurred from the time the file’s logical
that logical block. block was transferred from the TSM server
When the tape drive receives a logical until it is written to tape. And using these
block, it computes its own CRC for the data ECCs and CRCs, the tape drive can validate
and compares it to the original data CRC. If logical blocks at full line speed as they are
an error is detected, a check condition is being written!
generated, forcing a re-drive or a During a read operation (i.e., when
permanent error—effectively guaranteeing Rosetta accesses an AIP), data is read from
protection of the logical block during the tape and all three codes (C1, C2, and the
transfer. original data CRC) are decoded and
In addition, as the logical block is loaded checked. A read error is generated if any
into the drive’s main data buffer, two other process indicates an error.
processes occur— The original data CRC is then appended
(1) Data received at the buffer is cycled to the logical block when it is transferred to
back through an on-the-fly verifier that the TSM server so it can be independently
once again validates the original data CRC. verified by that server, thus completing the
Any introduced error will again force a re- TSM end-to-end logical block protection
drive or a permanent error. cycle.
(2) In parallel, an error-correcting code This advanced and highly efficient TSM
(ECC) is computed and appended to the end-to-end logical block protection is
data. Referred to as the “C1 code,” this ECC enabled with state-of-the-art functions
protects data integrity of the logical block available with IBM LTO-5 and TS1140 tape
as it goes through additional formatting drives.
steps—including the addition of an When the TSM server sends the data
additional ECC, referred to as the “C2 over the network to a TSM client, CRC
code.” checking is done once again to ensure
As part of these formatting steps, the C1 integrity of the data as it is written to the
code is checked every time data is read StorageGRID storage node.
from the data buffer. Thus, protection of the From there, StorageGRID fixity checking
original data CRC is essentially occurs, as explained previously for object
transformed to protection from the more access—including content hash and key-
powerful C1 code. based hash value checking—until the data
Finally, the data is read from the main is transferred to Rosetta for delivery to its
buffer and is written to tape using a read- requestor, thus completing the DRPS data
while-write process. During this process, integrity validation cycle.
the just written data is read back from tape
and loaded into the main data buffer so the
5
6. occur after AIPs are written correctly to
DRPS Data Integrity Validation
tape.
SHA-1
control
DRPS Ingest Tools SHA-1 created for producer files
SHA-1 SHA-1 checked upon ingest Conclusion
control and write to permanent storage
Web service retrieves StorageGRID The Church of Jesus Christ of Latter-day
Storage Extensions SHA-1, then Rosetta plug-in
compares with Rosetta SHA-1 Saints is making a substantial investment to
SHA-1
StorageGRID
SHA-1 created for ingested files continuously ensure data integrity of its
control SHA-1 and other fixity checked
during write to storage nodes DRPS archive as described in this paper.
CRCs, IBM TSM end-to-end logical block
ECCs Tivoli Storage Manager
protection
The benefits of preserving the Church’s
exalting and inspiring records cannot be
measured in financial terms, however.
Summary of the DRPS data integrity validation cycle Those benefits include building character
and strengthening families—both of which
Ensuring Ongoing Data Integrity
are designed to foster both personal and
Unfortunately, continuously ensuring family happiness.
data integrity of a DRPS AIP does not end
once the AIP has been written correctly to References
tape. Periodically, the tape(s) containing the 1 CCSDS 650.0-B-1BLUE BOOK, “Reference Model
AIP needs to be checked to uncover errors for an Open Archival Information System (OAIS),”
(i.e., bit flips) that may have occurred since Consultative Committee for Space Data Systems
the AIP was correctly written. (2002)
2 Private conversation with Sam Gustman (CTO) at
Fortunately, IBM LTO-5 and TS1140 tape
drives can perform this check without the USC Shoah Foundation Institute August 19, 2009
having to stage the AIP to disk, which is
clearly a resource intensive task—especially
for an archive with a capacity measured in
hundreds of petabytes!
IBM LTO-5 and TS1140 drives can
perform data integrity validation in-drive,
which means a drive can read a tape and
concurrently check the AIP logical block
CRC and ECCs discussed above (C1, C2,
and the original data CRC). Good or bad
status is reported as soon as these internal
checks are completed. And this is done
without requiring any other resources!
Clearly, this advanced capability
enhances the ability of DRPS to perform
periodic data integrity validations of the
entire archive more frequently, which will
facilitate the correction of bit flips that
6