Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
1330 mon katrine york
1. HATHITRUST
A Shared Digital Repository
HathiTrust: Aspiring to Build
the Universal Library
UKSG Annual Conference
March 26-28, 2012
Jeremy York, Project Librarian, HathiTrust
2. Partnership
Arizona State University North Carolina State University of Connecticut
Baylor University University University of Florida
Boston College Northwestern University University of Illinois
Boston University The Ohio State University University of Illinois at Chicago
California Digital Library The Pennsylvania State The University of Iowa
Columbia University University
Princeton University University of Maryland
Cornell University
Purdue University University of Miami
Dartmouth College
Duke University Stanford University University of Michigan
Emory University Texas A&M University University of Minnesota
Florida State University Universidad Complutense University of Missouri
Getty Research Institute de Madrid University of Nebraska-Lincoln
Harvard University Library University of Arizona The University of North
Indiana University University of Calgary Carolina at Chapel Hill
Johns Hopkins University University of California University of Notre Dame
Lafayette College Berkeley
Davis University of Pennsylvania
Library of Congress
Irvine University of Pittsburgh
Massachusetts Institute of
Technology Los Angeles University of Utah
McGill University` Merced University of Virginia
Michigan State University Riverside University of Washington
New York Public Library San Diego University of Wisconsin-
New York University San Francisco Madison
North Carolina Central Santa Barbara Utah State University
University Santa Cruz
Washington University
The University of Chicago
Yale University Library
3. Digital Repository
• Launched 2008
• Initial focus on digitized book and journal
content
– 10,109,919 total volumes
– 5,372,755 book titles
– 266,540 serial titles
– 2,802,347 public domain (~28%)
4. The Name
• The meaning behind the name
– Hathi (hah-tee)--Hindi for elephant
– Big, strong
– Never forgets, wise
– Secure
– Trustworthy
5. Mission
• To contribute to the common good by collecting,
organizing, preserving, communicating, and
sharing the record of human knowledge
6. HathiTrust
Universal Library
Common Goal
Single Entity, Many Partners
7. Collections and Collaboration
• Comprehensive collection
- Preservation…with Access
• Shared strategies
– Copyright
– Collection management, development
– Preservation
– Discovery / Use
– Bibliographic Indeterminacy
– Efficient user services
• Public Good
8. Content Distribution
U.S. Federal
Government
Documents
(worldwide)
4%
Public
Domain
72% "Public Domain" Public Domain (US)
28% (worldwide) 10%
14%
Open Access
.1%
Creative Commons
.01%
9. Content Sources
LC Minnesota
1% 1%
Yale UNC-Chapel Hill
Harvard Madrid 0%
Virginia 0% Utah State
Indiana 1% 1%
0% 0% Chicago
2% NCSU 0%
Columbia NorthwesternDuke
1% 0% 0%
Princeton 0% Illinois
Purdue Penn State
3% 0%
NYPL 0% 0%
Cornell 3%
Wisconsin 4%
5% Michigan
45%
California
33%
12. Language Distribution (1)
The top 10 languages make up
Remaining ~86% of all content
Languages
Arabic Latin 14%
Italian 2% 1%
3% Japanese
3% English
Russian 48%
4%
Chinese
German
4%
9%
Spanish
5%
French
7%
13. Language Distribution (2)
Bulgarian ArmenianAncient-Greek
Panjabi Catalan Malayalam
1% 1%
1% 1% 1% 1%
Multiple The next 40
Sanskrit 1%
2% Ukrainian Serbian Marathi Malay Undetermined
languages make
1% 1%Romanian Telugu 1%
1% Finnish 7% up ~13% of total
Slovak
Vietnamese Greek 1% 1% 1%1% Polish
Hungarian 1% 7%
1%
2% Portuguese
Norwegian Dutch 7%
2% 5%
Music
2%
Bengali Tamil
2% Hebrew
2%
5%
Persian Hindi
2% 5%
Unknown Czech
Indonesian
3% 3% Thai Korean
Turkish Urdu 4%
Danish 3% Swedish 4%
Croatian 3% 3%
3% 3%
2%
14. Preservation with Access
• Cost effective preservation and access services
• Preservation
– TRAC-certified
– Robust infrastructure
– Long-term commitments on digital content
facilitate planning, decision-making
15. Executive Committee
Strategic Advisory Board
Budget/Finances Decision-making
Guidance on Policy, Planning
Collective Work: Working
Groups and Committees
Operational
Operational Strategic
•• Communications
Communications • Collections
•• User Support
User Support • Discovery Interface
•• User Experience
User Experience • Full-text Search
Distributed work
• Driven by needs of institutions
• Leverage across the partnership
• Projects, Grant Work, Ingest
Specifications, PageTurner, Bibliographic Data Management
HathiTrust
16. Bibliographic
Enterprise Repository Repository Rights Collection
Governance Data
Management Administration Administration Management Development
Management
Communication Data management Digital
Budget, Finances Hardware Copyright Entity description
and Coordination (content • Expansion beyond
configuration and determination (record-level)
with partner storage, backup, in books and journals
institutions maintenance (born-
Decision-making tegrity digital, images and
checks, deletion) Object maps, audio)
Project Copyright review identification • Selection of
Policy management Web and (item-level) content (for non-
application server Google volume
configuration and Hardware selection ingest and pilots
Copyright projects)
maintenance and replacement information Data availability
Planning
management Print
(database) • Cloud Library (effect
Security of digital on print)
Content and
Metadata
specifications Rightsholder
permissions
Permissions
Disaster Recovery
Logging
Processes for
ensuring content
integrity
Quality
e-Commerce Content Ingest Content Access User Services Outreach Legal
Assurance
Transformation PageTurner Quality Review Risk management
Print on Demand Usability Project website (use of materials)
Validation Collection Builder Content User support Partner
Certification Monthly agreements
(helpdesk)
newsletter
Large-scale Search Advocacy
Papers and
Financial presentations
contributions Research Center
HathiTrust Functional Communication
of partners
Bibliographic
Framework with potential
partners
Catalog
Surveys, general
APIs inquiries
Repository
evaluation and
audit
(e.g., DRAMBORA,
TRAC)
17. Constitutional Convention
• October 2011
• 52 partners
• 3-year review overseen by SAB
• Ballot Proposals
– Print monograph storage
– Approval Process for development initiatives
– U.S. Government Documents
– Fee-for-service content deposit
– Governance
18. Emerging Governance
• 12-member Board of Governors
– 3-member Executive Committee
– Executive Director
• 6 seats to founding institutions
– 2 California, 2 CIC (minus Indiana and Michigan)
– 1 Indiana, 1 Michigan
• Voting (March 1 – March 15)
• Announcement of Results March 30
• Begin work April 16, 2012
19. Preservation with Access
• Cost effective preservation and access services
• Preservation
– TRAC-certified
– Robust infrastructure
– Long-term commitments on digital content
facilitate planning, decision-making
20. Preservation with Access (2)
• Discovery
– Bibliographic and full-text search of all materials
– Extended discovery (ProQuest, EBSCO, OCLC, Ex
Libris)
– Mechanisms for local loading of records
21.
22.
23.
24. Preservation with Access (3)
• Access and Use
– Public domain and open access works
– Full download of materials where possible*
– Print on demand
– Collections and APIs
– Research Center*
– Lawful uses of in-copyright works*
25. Lawful uses
• Access to users who have print disabilities
• Section 108 uses of materials
• Access to orphan works
26. Terms of Access
• Available to students, faculty, staff of
partnering institutions
– On library premises or authenticated into
HathiTrust
• Partner libraries own a print copy
– One simultaneous user per print copy owned
• Users must be on U.S. soil
• One page at a time download
27. How do we facilitate uses?
• Fundamental issues of
– Identification
– Description
– Rights
28. Approach
• Collective problems as collective
• Web of relationships Rights
Records Digital
Volumes
Libraries Print Volumes
31. Automatic Rights Determination
• Conducted on all works at time of ingest and
when records are modified
– Public domain worldwide
• US works published before 1923, US federal
government publications, non-US works published prior
to 1872
– Public domain in the United States
• Non-US works published prior to 1923
32. Manual Rights Determination
• IMLS-funded CRMS project
– US-published works 1923-1963
– Conformance with formalities
– Expanding to non-US works
– Double-blind review with expert review for conflicts
– Staff at 4 HathiTrust partner institutions (15 will take
part in non-US)
– As of February 2012 ~190,000 reviewed, more than
100,000 opened
• Rights Holder Permissions
33. Breakdown of HathiTrust book corpus by publication date
Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building – 2/2011
42. A global change in the library environment
60%
Academic print book collection already substantially
50%
duplicated in mass digitized book corpus
June 2010
% of Titles in Local Collection
40% Median duplication: 31%
30%
20%
10% June 2009
Median duplication: 19%
0%
0 20 40 60 80 100 120
Rank in 2008 ARL Investment Index
43. Digitized Books in Shared Repositories
~3.5M titles
3,500,000
~75% of mass digitized corpus is ‘backed up’ in one
or more shared print repositories
3,000,000 ~2.5M
2,500,000
Unique Titles
2,000,000
1,500,000
1,000,000
500,000
0
Sep-09 Oct-09 Nov-09 Dec-09 Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10
Mass digitized books in Hathi digital repository Mass digitized books in shared print repositories
44. Collection Management, Development
• Overlap
– More than 50% median overlap with ARL
institutions; higher for small liberal arts colleges
• Pricing model based on Print holdings
– Requires print holdings database
– Also support expansion of legal uses, efforts in de-
duplication
– Facilitate individual and collaborative collection
development and management operations
• Print monographs archiving
65. Comprehensive Picture
• “Definitional Issues”
– Identification, Description, Rights
• Discovery and Use
– Finding
– Relating (APIs and integration)
– Using (Reading, Computational activities)
• Collection management, development
• Preservation infrastructure
– Digital and Print
– Relationships
66. Work going forward
• Definitional elements
• Print archiving, management
• Discovery and use
– Lawful uses
• Research Center
• Quality
• Government documents
• Beyond books and journals
• Publishing
• Transitioning to next phase of partnership
67. How to find out more
• Web site “About” section
• http://www.hathitrust.org/about
• HathiTrust Research Center
• http://www.hathitrust.org/htrc
• Twitter
• http://twitter.com/hathitrust
• Monthly newsletter
• http://www.hathitrust.org/updates
• RSS: http://www.hathitrust.org/updates_rss
• Contact us: feedback@issues.hathitrust.org
• Blogs: http://www.hathitrust.org/blogs
• Large-scale search
• Perspectives from HathiTrust