SlideShare a Scribd company logo
1 of 13
Elephant in the Room:
Scaling Storage for the
HathiTrust Research Center
Robert H. McDonald | @mcdonald
Associate Dean for Library Technologies
Deputy Director Data to Insight Center (D2I)
Indiana University
PASIG 2015
#PASIG2015
UC San Diego
March 12, 2015
@hathitrust
Mission of the HT Research Center
• Research arm of HathiTrust
• Established: July, 2011
• Collaborative center: Indiana University & University
of Illinois
• Mission: Enable researchers world-wide to accomplish
tera-scale text data-mining and analysis
– Develop cyberinfrastructure to enable HPC access to the
HathiTrust Digital Library
– Develop cutting-edge software tools for processing,
analyzing text
– Develop translational tools and data that can be used to
enhance HathiTrust Digital Library services to users
HathiTrust and HTRC
HathiTrust
University
of
Illinois
Indiana
University
HathiTrust
Research
Center
University
of
Michigan
• Board of Governors
• Executive Committee
• Executive Director
Working with HTRC Staff
Advanced Collaborative
Support
Scholarly Commons
Advanced Research
Workshops, tutorials, and
guidance for using HTRC
One-on-one research support
provided through a competitive
awards process
Collaborative research
partnership with HTRC
HathiTrust “Wow” Numbers
• 13,284,163 total volumes
• 6,742,394 book titles
• 352,534 serial titles
• 4,649,457,050 pages
• 595 terabytes
• 157 miles
• 10,793 tons
• 4,979,599 volumes in the public domain
Non-Consumptive Research Paradigm
• No action or set of actions on part of users,
either acting alone or in cooperation with
other users over duration of one or multiple
sessions can result in sufficient information
gathered from collection of copyrighted works
to reassemble pages from collection.
• Definition disallows collusion between users,
or accumulation of material over time.
Differentiates human researcher from proxy
which is not a user. Users are human beings.
DATA LEVELS FOR HTRC
• Derived Factual Data or Supplementary Bibliographic Data; Publicly available in US and
worldwide (bulk download permitted)Level 0
• Page text or Page images that are in public domain in US only and not subject to third
party restrictions; Publicly available to everyone in the US (no bulk download)Level 1A
• Page text or Page images that are in the public domain worldwide and not subject to
third party restrictions; Publicly available to everyone worldwide (no bulk download)Level 1B
• Primary Bibliographic Metadata; Publicly available in US and worldwide (no bulk
download)Level 1C
• Page text or Page images that are in public domain in US only and are subject to third
party restrictions; Publicly available to everyone in the US (no download)Level 2A
• Page text or Page images that are in public domain worldwide and are subject to third
party restrictions; Publicly available to everyone worldwide (no download)Level 2B
• Restricted in-copyright data; may or may not be subject to additional third- party
restrictions (no download)Level 3
Working with HTRC Tools
Get started at: https://htrc2.pti.indiana.edu/
Build Worksets
Execute Algorithms
Visualize Term Frequency
http://sandbox.htrc.illinois.edu/bookworm/
HTRC Architecture
Data API access interface
Portal Access
Direct
programmatic
access (by
programs running
on HTRC machines)
Security (OAuth2)
Audit
Cassandra
cluster
volume store
Solr index
Algorithms
Result Sets
Meandre
Workflows
Registry (WSO2)
Compute resources
Storage resources
Agent
Job
Submission
Collection
building
Collections
Blacklight
Solr Proxy
HTRC Storage Layer Ingest
HTRC Shared IU Systems
Data Systems
• Data Capacitor II
(Lustre)
• NetApp NFS (18 TB)
– GPFS/Data Direct
Networks (28 TB
Compute Systems
• KARST (high-
throughput cluster)
• Big Red 2 (Cray
XE6/XK7)
Current Storage (18 TB > 30 TB)
• MARC – 15 GB
• R – 5 TB
• iPython – 5 TB
• Bookworm – 5 TB
• Public Domain BW – 3 TB
• OCR Data - 5 TB
• OCR Index – 2.3 TB
• Audit Logs – 44 GB
• User created MD – 15 GB
• Blacklight – 20 GB
• IU Pairtree – 2.65. TB
Want More HTRC?
3rd Annual HTRC UnCamp!
March 30-31, 2015 in Ann Arbor, Michigan
DH 2015
June 29-3 July, 2015 in Sydney, Australia
http://www.hathitrust.org/htrc_uncamp2015
Thank You
• This presentation was made possible with content
provided by many HTRC colleagues John Unsworth, J.
Stephen Downie, Beth Plale, Beth Namachchivaya, Dirk
Herr-Hoyman, Milnda Pathirage, Samitha Liyanage,
Miao Chen, Guangchen Ruan, Jiaan Zeng, Loretta Auvil,
Boris Capitanu, and many others…
• The HTRC Non-Consumptive Research Grant was
graciously funded by the Alfred P. Sloan Foundation
• THE HTRC WCSA grant is graciously funded by the
Andrew W. Mellon Foundation.
• HTRC - http://www.hathitrust.org/htrc
• IU D2I Center - http://d2i.indiana.edu/
• UIUC GSLIS - http://www.lis.illinois.edu/

More Related Content

What's hot

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedJoel Azzopardi
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebStefan Dietze
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinkingwhalb
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinAnja Jentzsch
 
6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation Slides6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation SlidesDuraSpace
 
ORDS, research data network
ORDS, research data networkORDS, research data network
ORDS, research data networkJisc RDM
 
Clipper, research data network
Clipper, research data networkClipper, research data network
Clipper, research data networkJisc RDM
 
Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)DuraSpace
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...Stefan Schmunk
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusGlobus
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...Micah Altman
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportPascal-Nicolas Becker
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataFilip Ilievski
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsPeter Haase
 

What's hot (20)

Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joinedKeystone summer school_2015_miguel_antonio_ldcompression_4-joined
Keystone summer school_2015_miguel_antonio_ldcompression_4-joined
 
From Big Data to Fast Data
From Big Data to Fast DataFrom Big Data to Fast Data
From Big Data to Fast Data
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Open data and linked data
Open data and linked dataOpen data and linked data
Open data and linked data
 
Scripting User Contributed Interlinking
Scripting User Contributed InterlinkingScripting User Contributed Interlinking
Scripting User Contributed Interlinking
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
 
6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation Slides6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation Slides
 
ORDS, research data network
ORDS, research data networkORDS, research data network
ORDS, research data network
 
Clipper, research data network
Clipper, research data networkClipper, research data network
Clipper, research data network
 
Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)Digital Preservation in Production (DPN and DuraCloud Vault)
Digital Preservation in Production (DPN and DuraCloud Vault)
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
Health Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by GlobusHealth Sciences Research Informatics, Powered by Globus
Health Sciences Research Informatics, Powered by Globus
 
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
WORLDMAP: A SPATIAL INFRASTRUCTURE TO SUPPORT TEACHING AND RESEARCH (BROWN BA...
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
Digital libraries
Digital librariesDigital libraries
Digital libraries
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Discovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data PortalsDiscovering Related Data Sources in Data Portals
Discovering Related Data Sources in Data Portals
 

Viewers also liked

Moving the Elephant in the Room: Data Migration at Scale
Moving the Elephant in the Room: Data Migration at ScaleMoving the Elephant in the Room: Data Migration at Scale
Moving the Elephant in the Room: Data Migration at ScaleTyrone Hinderson
 
Strategy - The elephant in the room
Strategy - The elephant in the roomStrategy - The elephant in the room
Strategy - The elephant in the roomIIBA UK Chapter
 
Addressing the Elephant in the Room - Content Strategy
Addressing the Elephant in the Room - Content StrategyAddressing the Elephant in the Room - Content Strategy
Addressing the Elephant in the Room - Content StrategyRay Killebrew
 
asteRISK
asteRISKasteRISK
asteRISKkrnmcg
 
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYCElephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYCMike Lewis
 
Risk: the Elephant in the Room
Risk: the Elephant in the RoomRisk: the Elephant in the Room
Risk: the Elephant in the RoomLast Call Media
 
The elephant in the room. discussion
The elephant in the room. discussionThe elephant in the room. discussion
The elephant in the room. discussionAndrew Gelston
 
The elephant in the room mongo db + hadoop
The elephant in the room  mongo db + hadoopThe elephant in the room  mongo db + hadoop
The elephant in the room mongo db + hadoopiammutex
 
Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...
Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...
Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...ad:tech London, MMS & iMedia
 
CMS Expo 2011 Keynote - The Elephant in the Room
CMS Expo 2011 Keynote - The Elephant in the RoomCMS Expo 2011 Keynote - The Elephant in the Room
CMS Expo 2011 Keynote - The Elephant in the RoomScott Liewehr
 
RIDE 2011: Student dropout – the elephant in the room of distance education (...
RIDE 2011: Student dropout – the elephant in the room of distance education (...RIDE 2011: Student dropout – the elephant in the room of distance education (...
RIDE 2011: Student dropout – the elephant in the room of distance education (...Centre for Distance Education
 
The Elephant In The Room - Research Report 31 July 2013
The Elephant In The Room - Research Report 31 July 2013The Elephant In The Room - Research Report 31 July 2013
The Elephant In The Room - Research Report 31 July 2013Howard Cooke
 
The elephant in the room
The elephant in the roomThe elephant in the room
The elephant in the roomJohn Gillis
 
Kanban. Dealing with the elephant in the room. One chunk at a time
Kanban. Dealing with the elephant in the room. One chunk at a timeKanban. Dealing with the elephant in the room. One chunk at a time
Kanban. Dealing with the elephant in the room. One chunk at a timejsonnevelt
 
How to Tame the Elephant in the Room- 6 steps to build trust and close deals!
How to Tame the Elephant in the Room- 6 steps to build trust and close deals!How to Tame the Elephant in the Room- 6 steps to build trust and close deals!
How to Tame the Elephant in the Room- 6 steps to build trust and close deals!Mitch Jackson
 

Viewers also liked (20)

Moving the Elephant in the Room: Data Migration at Scale
Moving the Elephant in the Room: Data Migration at ScaleMoving the Elephant in the Room: Data Migration at Scale
Moving the Elephant in the Room: Data Migration at Scale
 
Strategy - The elephant in the room
Strategy - The elephant in the roomStrategy - The elephant in the room
Strategy - The elephant in the room
 
Addressing the Elephant in the Room - Content Strategy
Addressing the Elephant in the Room - Content StrategyAddressing the Elephant in the Room - Content Strategy
Addressing the Elephant in the Room - Content Strategy
 
Elephant in Room Version 2
Elephant in Room Version 2Elephant in Room Version 2
Elephant in Room Version 2
 
YUI The Elephant In The Room
YUI The Elephant In The RoomYUI The Elephant In The Room
YUI The Elephant In The Room
 
asteRISK
asteRISKasteRISK
asteRISK
 
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOMELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
ELEARNING IN ART AND DESIGN: THE ELEPHANT IN THE ROOM
 
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYCElephant in the Room: Social Media ROI - WEB 2.0 NYC
Elephant in the Room: Social Media ROI - WEB 2.0 NYC
 
Risk: the Elephant in the Room
Risk: the Elephant in the RoomRisk: the Elephant in the Room
Risk: the Elephant in the Room
 
The elephant in the room. discussion
The elephant in the room. discussionThe elephant in the room. discussion
The elephant in the room. discussion
 
The elephant in the room
The elephant in the roomThe elephant in the room
The elephant in the room
 
The elephant in the room mongo db + hadoop
The elephant in the room  mongo db + hadoopThe elephant in the room  mongo db + hadoop
The elephant in the room mongo db + hadoop
 
Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...
Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...
Lance Concannon, Sysomos: Simplifiying social - How marketers can manage the ...
 
CMS Expo 2011 Keynote - The Elephant in the Room
CMS Expo 2011 Keynote - The Elephant in the RoomCMS Expo 2011 Keynote - The Elephant in the Room
CMS Expo 2011 Keynote - The Elephant in the Room
 
G!
G!G!
G!
 
RIDE 2011: Student dropout – the elephant in the room of distance education (...
RIDE 2011: Student dropout – the elephant in the room of distance education (...RIDE 2011: Student dropout – the elephant in the room of distance education (...
RIDE 2011: Student dropout – the elephant in the room of distance education (...
 
The Elephant In The Room - Research Report 31 July 2013
The Elephant In The Room - Research Report 31 July 2013The Elephant In The Room - Research Report 31 July 2013
The Elephant In The Room - Research Report 31 July 2013
 
The elephant in the room
The elephant in the roomThe elephant in the room
The elephant in the room
 
Kanban. Dealing with the elephant in the room. One chunk at a time
Kanban. Dealing with the elephant in the room. One chunk at a timeKanban. Dealing with the elephant in the room. One chunk at a time
Kanban. Dealing with the elephant in the room. One chunk at a time
 
How to Tame the Elephant in the Room- 6 steps to build trust and close deals!
How to Tame the Elephant in the Room- 6 steps to build trust and close deals!How to Tame the Elephant in the Room- 6 steps to build trust and close deals!
How to Tame the Elephant in the Room- 6 steps to build trust and close deals!
 

Similar to Elephant in the Room: Scaling Storage for the HathiTrust Research Center

HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsBeth Plale
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesRobert H. McDonald
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14Robert H. McDonald
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Beth Plale
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorialJosh Young
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...Jenny Mitcham
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Sarah Anna Stewart
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositoriesPaul Walk
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
OpenAIRE: eInfrastructure for Open Science
OpenAIRE: eInfrastructure for Open ScienceOpenAIRE: eInfrastructure for Open Science
OpenAIRE: eInfrastructure for Open ScienceOpenAIRE
 
Federated Architecture with Provenance and Access Control to realize Open Dig...
Federated Architecture with Provenance and Access Control to realize Open Dig...Federated Architecture with Provenance and Access Control to realize Open Dig...
Federated Architecture with Provenance and Access Control to realize Open Dig...Artificial Intelligence Institute at UofSC
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationMANENDRASINGH30
 

Similar to Elephant in the Room: Scaling Storage for the HathiTrust Research Center (20)

HathiTrust Research Center Secure Commons
HathiTrust Research Center Secure CommonsHathiTrust Research Center Secure Commons
HathiTrust Research Center Secure Commons
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial2016 Ocean Sciences Meeting tutorial
2016 Ocean Sciences Meeting tutorial
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositories
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Big Data
Big Data Big Data
Big Data
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
OpenAIRE: eInfrastructure for Open Science
OpenAIRE: eInfrastructure for Open ScienceOpenAIRE: eInfrastructure for Open Science
OpenAIRE: eInfrastructure for Open Science
 
Federated Architecture with Provenance and Access Control to realize Open Dig...
Federated Architecture with Provenance and Access Control to realize Open Dig...Federated Architecture with Provenance and Access Control to realize Open Dig...
Federated Architecture with Provenance and Access Control to realize Open Dig...
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
 

More from Robert H. McDonald

ER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelRobert H. McDonald
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...Robert H. McDonald
 
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Robert H. McDonald
 
TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15Robert H. McDonald
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesRobert H. McDonald
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Robert H. McDonald
 
ER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote SlidesER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote SlidesRobert H. McDonald
 
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkRobert H. McDonald
 
Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsRobert H. McDonald
 
Kuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for LibrariesKuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for LibrariesRobert H. McDonald
 
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
Charleston Seminar Being Earnest with our Collections - Legacy to CloudCharleston Seminar Being Earnest with our Collections - Legacy to Cloud
Charleston Seminar Being Earnest with our Collections - Legacy to CloudRobert H. McDonald
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoRobert H. McDonald
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science Robert H. McDonald
 
New Perspectives for Business Intelligence: Library and Research Technologies...
New Perspectives for Business Intelligence: Library and Research Technologies...New Perspectives for Business Intelligence: Library and Research Technologies...
New Perspectives for Business Intelligence: Library and Research Technologies...Robert H. McDonald
 
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...Robert H. McDonald
 
GOKb & KB+: An International Partnership to leverage Open Access and Communit...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...GOKb & KB+: An International Partnership to leverage Open Access and Communit...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...Robert H. McDonald
 
HathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast VersionHathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast VersionRobert H. McDonald
 

More from Robert H. McDonald (20)

ER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
 
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
 
TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
 
ER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote SlidesER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote Slides
 
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
 
Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your Patrons
 
Kuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for LibrariesKuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for Libraries
 
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
Charleston Seminar Being Earnest with our Collections - Legacy to CloudCharleston Seminar Being Earnest with our Collections - Legacy to Cloud
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
 
SCONUL Kuali OLE Briefing
SCONUL Kuali OLE BriefingSCONUL Kuali OLE Briefing
SCONUL Kuali OLE Briefing
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
New Perspectives for Business Intelligence: Library and Research Technologies...
New Perspectives for Business Intelligence: Library and Research Technologies...New Perspectives for Business Intelligence: Library and Research Technologies...
New Perspectives for Business Intelligence: Library and Research Technologies...
 
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
 
GOKb & KB+: An International Partnership to leverage Open Access and Communit...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...GOKb & KB+: An International Partnership to leverage Open Access and Communit...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...
 
Kuali OLE @ LITA Forum 2012
Kuali OLE @ LITA Forum 2012Kuali OLE @ LITA Forum 2012
Kuali OLE @ LITA Forum 2012
 
HathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast VersionHathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast Version
 
HTRC Architecture Overview
HTRC Architecture OverviewHTRC Architecture Overview
HTRC Architecture Overview
 

Recently uploaded

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 

Recently uploaded (20)

Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

Elephant in the Room: Scaling Storage for the HathiTrust Research Center

  • 1. Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald | @mcdonald Associate Dean for Library Technologies Deputy Director Data to Insight Center (D2I) Indiana University PASIG 2015 #PASIG2015 UC San Diego March 12, 2015 @hathitrust
  • 2. Mission of the HT Research Center • Research arm of HathiTrust • Established: July, 2011 • Collaborative center: Indiana University & University of Illinois • Mission: Enable researchers world-wide to accomplish tera-scale text data-mining and analysis – Develop cyberinfrastructure to enable HPC access to the HathiTrust Digital Library – Develop cutting-edge software tools for processing, analyzing text – Develop translational tools and data that can be used to enhance HathiTrust Digital Library services to users
  • 4. Working with HTRC Staff Advanced Collaborative Support Scholarly Commons Advanced Research Workshops, tutorials, and guidance for using HTRC One-on-one research support provided through a competitive awards process Collaborative research partnership with HTRC
  • 5. HathiTrust “Wow” Numbers • 13,284,163 total volumes • 6,742,394 book titles • 352,534 serial titles • 4,649,457,050 pages • 595 terabytes • 157 miles • 10,793 tons • 4,979,599 volumes in the public domain
  • 6. Non-Consumptive Research Paradigm • No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
  • 7. DATA LEVELS FOR HTRC • Derived Factual Data or Supplementary Bibliographic Data; Publicly available in US and worldwide (bulk download permitted)Level 0 • Page text or Page images that are in public domain in US only and not subject to third party restrictions; Publicly available to everyone in the US (no bulk download)Level 1A • Page text or Page images that are in the public domain worldwide and not subject to third party restrictions; Publicly available to everyone worldwide (no bulk download)Level 1B • Primary Bibliographic Metadata; Publicly available in US and worldwide (no bulk download)Level 1C • Page text or Page images that are in public domain in US only and are subject to third party restrictions; Publicly available to everyone in the US (no download)Level 2A • Page text or Page images that are in public domain worldwide and are subject to third party restrictions; Publicly available to everyone worldwide (no download)Level 2B • Restricted in-copyright data; may or may not be subject to additional third- party restrictions (no download)Level 3
  • 8. Working with HTRC Tools Get started at: https://htrc2.pti.indiana.edu/ Build Worksets Execute Algorithms Visualize Term Frequency http://sandbox.htrc.illinois.edu/bookworm/
  • 9. HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy
  • 11. HTRC Shared IU Systems Data Systems • Data Capacitor II (Lustre) • NetApp NFS (18 TB) – GPFS/Data Direct Networks (28 TB Compute Systems • KARST (high- throughput cluster) • Big Red 2 (Cray XE6/XK7) Current Storage (18 TB > 30 TB) • MARC – 15 GB • R – 5 TB • iPython – 5 TB • Bookworm – 5 TB • Public Domain BW – 3 TB • OCR Data - 5 TB • OCR Index – 2.3 TB • Audit Logs – 44 GB • User created MD – 15 GB • Blacklight – 20 GB • IU Pairtree – 2.65. TB
  • 12. Want More HTRC? 3rd Annual HTRC UnCamp! March 30-31, 2015 in Ann Arbor, Michigan DH 2015 June 29-3 July, 2015 in Sydney, Australia http://www.hathitrust.org/htrc_uncamp2015
  • 13. Thank You • This presentation was made possible with content provided by many HTRC colleagues John Unsworth, J. Stephen Downie, Beth Plale, Beth Namachchivaya, Dirk Herr-Hoyman, Milnda Pathirage, Samitha Liyanage, Miao Chen, Guangchen Ruan, Jiaan Zeng, Loretta Auvil, Boris Capitanu, and many others… • The HTRC Non-Consumptive Research Grant was graciously funded by the Alfred P. Sloan Foundation • THE HTRC WCSA grant is graciously funded by the Andrew W. Mellon Foundation. • HTRC - http://www.hathitrust.org/htrc • IU D2I Center - http://d2i.indiana.edu/ • UIUC GSLIS - http://www.lis.illinois.edu/

Editor's Notes

  1. 3
  2. Level 0 – Derived Factual Data or Supplementary Bibliographic Data; Publicly available in US and worldwide (bulk download permitted)   Level 1.A – Page text or Page images that are in public domain in US only and not subject to third party restrictions; Publicly available to everyone in the US (no bulk download)   Level 1.B – Page text or Page images that are in the public domain worldwide and not subject to third party restrictions; Publicly available to everyone worldwide (no bulk download)   Level 1.C – Primary Bibliographic Metadata; Publicly available in US and worldwide (no bulk download)   Level 2.A – Page text or Page images that are in public domain in US only and are subject to third party restrictions; Publicly available to everyone in the US (no download)   Level 2.B – Page text or Page images that are in public domain worldwide and are subject to third party restrictions; Publicly available to everyone worldwide (no download)   Level 3 – Restricted in-copyright data; may or may not be subject to additional third- party restrictions (no download)
  3. Registry – agent can deploy any service listed in this digram and can run with the computational resources – Original Plan iis to use XSEDE – not using this on IIS machine but are using ODIN (128 node cluster each core has 4Gb memory and 4 computation cores)– smoketree (D2I server)(24 cores physical 48 loical cores 128 GB memory) – these are not long term just using for now -