SlideShare una empresa de Scribd logo
1 de 26
Sven Schlarb
Austrian National Library
Elag 2013
Gent, Belgium, May 29, 2013
An open source infrastructure for preserving
large collections of digital objects
The SCAPE project at the Austrian National Library
• SCAPE project overview
• Application areas at the Austrian National Library
• Web Archiving
• Austrian Books Online
• SCAPE at the Austrian National Library
• Hardware set-up
• Open source software architecture
• Application Scenarios
• Lessons learnt
2
Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Ability to process large
and complex data sets in
preservation scenarios
• Increasing amount of data
in data centers and
memory institutions
Motivation
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Volume, Velocity, and Variety
of data
1970 2000 2030
cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge.
available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
• “Big data” is a buzzword, just a vague idea
• No definitive GB, TB, PB, (…) threshold where data
becomes “big data”, depends on the institutional
context
• Massive growth of data to be stored and processed
• Situation where solutions that are usually employed do
not fulfill new scalability requirements
• Pushing the limit of conventional data base solutions
• Simple batch processing becomes tedious or even impossible
„Big Data“ in a Library context?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
5
• EU-funded FP7 project,
lead by Austrian Institute
of Technology
• Consortium: 16 Partners
• National Libraries
• Data Centers and Memory
Institutions
• Research institutes and
Universities
• Commercial partners
• Started 2011, runs until
mid/end-2014
SCAPE Consortium
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Takeup
•Stakeholders and Communities
•Dissemination
•Training Activities
•Sustainability
SCAPE Project Overview
Platform
•Automation
•Workflows
•Parallelization
•Virtualization
MapReduce/Hadoop in a nutshell
7
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Task1
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
Aggregated
Result
Aggregated
Result
Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective
•Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.
 25 processing cores for Map tasks and
 10 cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
• Access via REST API
• Workflow engine for complex
jobs
• Hive as the frontend for
analytic queries
• MapReduce/Pig for
Extraction, Transform, and
Load (ETL)
• „Small“ objects in HDFS or
HBase
• „Large “ Digital objects stored
on NetApp Filer
9
Platform Architecture
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Taverna Workflow engine
REST API
• Web Archiving
• Scenario 1: Web Archive Mime Type Identification
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
Application scenarios
• Physical storage 19 TB
• Raw data 32 TB
• Number of objects
1.241.650.566
• Domain harvesting
• Entire top-level-domain
.at every 2 years
• Selective harvesting
• Important websites that
change regularly
• Event harvesting
• Special occasions and
events (e.g. elections)
Key Data Web Archiving
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
based on
HERITRIX
Web crawler
read/write (W)ARC
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
Scenario 1: Web Archive Mime Type Identification
TIKA 1.0DROID 6.01
Scenario 1: Web Archive Mime Type Identification
• Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 130K physical volumes scanned so far
• ~ 40 Mio pages
Key Data Austrian Books Online
15
ADOCO (Austrian Books Online Download & Control)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO
• Task: Image file format migration
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the JPEG2000 file
format obsolescense risk
• Challenges:
• Integrating validation, migration, and
quality assurance
• Computing intensive quality
assurance
Scenario 2: Image file format migration
• Task: Compare different versions of the same book
• Images have been manipulated (cropped, rotated) and stored
in different locations
• Images come from different scanning sources or were subject
to different modification procedures
• Challenges:
• Computing intensive (Average runtime per book on a single
quad-core server ~ 4,5 hours)
• 130.000 books, ~320 pages each
• SCAPE tool: Matchbox
Scenario 2: Comparison of book derivatives
• ETL Processing of 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-
data-processing-chaining-hadoop-jobs-using-taverna
Scenario 3: MapReduce in Quality Assurance
19
• Create input text files
containing file paths (JP2
& HTML)
• Read image metadata
using Exiftool (Hadoop
Streaming API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query
Scenario 3: MapReduce in Quality Assurance
20
find
/NAS/Z119585409/00000001.jp2
/NAS/Z119585409/00000002.jp2
/NAS/Z119585409/00000003.jp2
…
/NAS/Z117655409/00000001.jp2
/NAS/Z117655409/00000002.jp2
/NAS/Z117655409/00000003.jp2
…
/NAS/Z119585987/00000001.jp2
/NAS/Z119585987/00000002.jp2
/NAS/Z119585987/00000003.jp2
…
/NAS/Z119584539/00000001.jp2
/NAS/Z119584539/00000002.jp2
/NAS/Z119584539/00000003.jp2
…
/NAS/Z119599879/00000001.jp2l
/NAS/Z119589879/00000002.jp2
/NAS/Z119589879/00000003.jp2
...
...
NAS
reading files from NAS
1,4 GB 1,2 GB
60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h
Jp2PathCreator HadoopStreamingExiftoolRead
Z119585409/00000001 2345
Z119585409/00000002 2340
Z119585409/00000003 2543
…
Z117655409/00000001 2300
Z117655409/00000002 2300
Z117655409/00000003 2345
…
Z119585987/00000001 2300
Z119585987/00000002 2340
Z119585987/00000003 2432
…
Z119584539/00000001 5205
Z119584539/00000002 2310
Z119584539/00000003 2134
…
Z119599879/00000001 2312
Z119589879/00000002 2300
Z119589879/00000003 2300
...
Reading image metadata
21
find
/NAS/Z119585409/00000707.html
/NAS/Z119585409/00000708.html
/NAS/Z119585409/00000709.html
…
/NAS/Z138682341/00000707.html
/NAS/Z138682341/00000708.html
/NAS/Z138682341/00000709.html
…
/NAS/Z178791257/00000707.html
/NAS/Z178791257/00000708.html
/NAS/Z178791257/00000709.html
…
/NAS/Z967985409/00000707.html
/NAS/Z967985409/00000708.html
/NAS/Z967985409/00000709.html
…
/NAS/Z196545409/00000707.html
/NAS/Z196545409/00000708.html
/NAS/Z196545409/00000709.html
...
Z119585409/00000707
Z119585409/00000708
Z119585409/00000709
Z119585409/00000710
Z119585409/00000711
Z119585409/00000712
NAS
reading files from NAS
1,4 GB 997 GB (uncompressed)
60.000 books (24 Million pages): ~ 5 h + ~ 24 h = ~ 29 h
HtmlPathCreator SequenceFileCreator
SequenceFile creation
22
Z119585409/00000001
Z119585409/00000002
Z119585409/00000003
Z119585409/00000004
Z119585409/00000005
...
Z119585409/00000001 2100
Z119585409/00000001 2200
Z119585409/00000001 2300
Z119585409/00000001 2400
Z119585409/00000002 2100
Z119585409/00000002 2200
Z119585409/00000002 2300
Z119585409/00000002 2400
Z119585409/00000003 2100
Z119585409/00000003 2200
Z119585409/00000003 2300
Z119585409/00000003 2400
Z119585409/00000004 2100
Z119585409/00000004 2200
Z119585409/00000004 2300
Z119585409/00000004 2400
Z119585409/00000005 2100
Z119585409/00000005 2200
Z119585409/00000005 2300
Z119585409/00000005 2400
Z119585409/00000001 2250
Z119585409/00000002 2250
Z119585409/00000003 2250
Z119585409/00000004 2250
Z119585409/00000005 2250
Map Reduce
HadoopAvBlockWidthMapReduce
SequenceFile Textfile
Calculate average block width using MapReduce
60.000 books (24 Million pages): ~ 6 h
23
HiveLoadExifData & HiveLoadHocrData
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidth
jp2width
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
CREATE TABLE jp2width
(hid STRING, jwidth INT)
CREATE TABLE jp2width
(hid STRING, jwidth INT)
CREATE TABLE htmlwidth
(hid STRING, hwidth INT)
CREATE TABLE htmlwidth
(hid STRING, hwidth INT)
Analytic Queries
24
HiveSelect
jid jwidth
Z119585409/00000001 2250
Z119585409/00000002 2150
Z119585409/00000003 2125
Z119585409/00000004 2125
Z119585409/00000005 2250
hid hwidth
Z119585409/00000001 1870
Z119585409/00000002 2100
Z119585409/00000003 2015
Z119585409/00000004 1350
Z119585409/00000005 1700
htmlwidthjp2width
jid jwidth hwidth
Z119585409/00000001 2250 1870
Z119585409/00000002 2150 2100
Z119585409/00000003 2125 2015
Z119585409/00000004 2125 1350
Z119585409/00000005 2250 1700
select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid
Analytic Queries
• Emergence of new options for creating large-scale
storage and processing infrastructures
• HDFS as storage master or staging area?
• Create a local cluster or rent a cloud infrastructure?
• Apache Hadoop offers a stable core for building a large
scale processing platform that is ready to be used in
production
• Important to select carefully additional components
from the Apache Hadoop Ecosystem (HBase, Hive, Pig,
Oozie, Yarn, Ambari, etc.) that fit your needs
25
Lessons learnt
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Further information
•Project website: www.scape-project.eu
•Github repository: www.github.com/openplanets
•Project Wiki: www.wiki.opf-labs.org/display/SP/Home
SCAPE tools mentioned
•SCAPE Platform
• http://www.scape-project.eu/publication/an-architectural-overview-
of-the-scape-preservation-platform
•Jpylyzer – Jpeg2000 validation
• http://www.openplanetsfoundation.org/software/jpylyzer
•Matchbox – Image comparison
• https://github.com/openplanets/scape/tree/master/pc-qa-matchbox
Thank you! Questions?

Más contenido relacionado

La actualidad más candente

ESCAPE Kick-off meeting - WP5 (Feb 2019)
ESCAPE Kick-off meeting - WP5 (Feb 2019)ESCAPE Kick-off meeting - WP5 (Feb 2019)
ESCAPE Kick-off meeting - WP5 (Feb 2019)ESCAPE EU
 
EOSC-hub Early Adopter Programme
EOSC-hub Early Adopter ProgrammeEOSC-hub Early Adopter Programme
EOSC-hub Early Adopter ProgrammeEOSC-hub project
 
OpenAIRE for SPARC and SPARC Europe
OpenAIRE for SPARC and SPARC EuropeOpenAIRE for SPARC and SPARC Europe
OpenAIRE for SPARC and SPARC EuropeOpenAIRE
 
Software for data management and exploitation
Software for data management and exploitationSoftware for data management and exploitation
Software for data management and exploitationEOSC-hub project
 
Experience in managing service portfolio by Pasquale Pagano
Experience in managing service portfolio by Pasquale PaganoExperience in managing service portfolio by Pasquale Pagano
Experience in managing service portfolio by Pasquale PaganoBlue BRIDGE
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionEuropeana Newspapers
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Onlinecneudecker
 
The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...Rafael C. Jimenez
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and CeremonyArchiver
 
INSPIRE data scope
INSPIRE data scopeINSPIRE data scope
INSPIRE data scopeinspireeu
 
Cross e-Infrastructure collaborations
Cross e-Infrastructure collaborationsCross e-Infrastructure collaborations
Cross e-Infrastructure collaborationsEUDAT
 
Latif ladid gen6 overview
Latif ladid gen6 overviewLatif ladid gen6 overview
Latif ladid gen6 overviewGlobalForum
 
Science Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth SciencesScience Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth SciencesEOSCpilot .eu
 
OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"
OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"
OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"OpenAIRE
 
1st Technical Meeting - WP8
1st Technical Meeting - WP81st Technical Meeting - WP8
1st Technical Meeting - WP8SLOPE Project
 
Linked Data with hybrid services in Agriculture
Linked Data with hybrid services in AgricultureLinked Data with hybrid services in Agriculture
Linked Data with hybrid services in AgricultureRaul Palma
 

La actualidad más candente (20)

ESCAPE Kick-off meeting - WP5 (Feb 2019)
ESCAPE Kick-off meeting - WP5 (Feb 2019)ESCAPE Kick-off meeting - WP5 (Feb 2019)
ESCAPE Kick-off meeting - WP5 (Feb 2019)
 
Metadata
MetadataMetadata
Metadata
 
Refinement
RefinementRefinement
Refinement
 
EOSC-hub Early Adopter Programme
EOSC-hub Early Adopter ProgrammeEOSC-hub Early Adopter Programme
EOSC-hub Early Adopter Programme
 
OpenAIRE for SPARC and SPARC Europe
OpenAIRE for SPARC and SPARC EuropeOpenAIRE for SPARC and SPARC Europe
OpenAIRE for SPARC and SPARC Europe
 
Software for data management and exploitation
Software for data management and exploitationSoftware for data management and exploitation
Software for data management and exploitation
 
Experience in managing service portfolio by Pasquale Pagano
Experience in managing service portfolio by Pasquale PaganoExperience in managing service portfolio by Pasquale Pagano
Experience in managing service portfolio by Pasquale Pagano
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introduction
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
INSPIRE data scope
INSPIRE data scopeINSPIRE data scope
INSPIRE data scope
 
Cross e-Infrastructure collaborations
Cross e-Infrastructure collaborationsCross e-Infrastructure collaborations
Cross e-Infrastructure collaborations
 
Latif ladid gen6 overview
Latif ladid gen6 overviewLatif ladid gen6 overview
Latif ladid gen6 overview
 
Science Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth SciencesScience Demonstrator Session: Social and Earth Sciences
Science Demonstrator Session: Social and Earth Sciences
 
OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"
OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"
OpenAIRE "How to make your repository OpenAIRE compliant: EPrints"
 
1st Technical Meeting - WP8
1st Technical Meeting - WP81st Technical Meeting - WP8
1st Technical Meeting - WP8
 
Linked Data with hybrid services in Agriculture
Linked Data with hybrid services in AgricultureLinked Data with hybrid services in Agriculture
Linked Data with hybrid services in Agriculture
 

Destacado

Commission on Wartime Contracting in Iraq and Afghanistan
Commission on Wartime Contracting in Iraq and AfghanistanCommission on Wartime Contracting in Iraq and Afghanistan
Commission on Wartime Contracting in Iraq and Afghanistanjddurso
 
Даша відвідує музей
Даша відвідує музейДаша відвідує музей
Даша відвідує музейIrina Genih
 
Proposta alle aziende 2012 2013 nov12
Proposta alle aziende 2012 2013 nov12Proposta alle aziende 2012 2013 nov12
Proposta alle aziende 2012 2013 nov12Bruno Mazzoleni
 
Boosting your child’s self esteem
Boosting your child’s self esteemBoosting your child’s self esteem
Boosting your child’s self esteembookiegold4u
 
Gestión tactica de fallas criticas con análisis probabilístico
Gestión tactica de fallas criticas con análisis probabilísticoGestión tactica de fallas criticas con análisis probabilístico
Gestión tactica de fallas criticas con análisis probabilísticoAdolfo Hitler Huaman Diaz
 
access-control-week-2
access-control-week-2access-control-week-2
access-control-week-2jemtallon
 
Hemophilia
HemophiliaHemophilia
Hemophiliagisellg
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibrarySven Schlarb
 
quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556ma ga ka lom
 
2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuse2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuseLadystellas
 
eBook Unified Communications
eBook Unified Communications eBook Unified Communications
eBook Unified Communications Khiara McMillin
 
Моя перша презентація - 5 клас
Моя перша презентація - 5 класМоя перша презентація - 5 клас
Моя перша презентація - 5 класIrina Genih
 
Capitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebroCapitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebroFrancisco Xavier
 

Destacado (19)

Mi tierra
Mi tierraMi tierra
Mi tierra
 
Commission on Wartime Contracting in Iraq and Afghanistan
Commission on Wartime Contracting in Iraq and AfghanistanCommission on Wartime Contracting in Iraq and Afghanistan
Commission on Wartime Contracting in Iraq and Afghanistan
 
Даша відвідує музей
Даша відвідує музейДаша відвідує музей
Даша відвідує музей
 
Memories photos
Memories photosMemories photos
Memories photos
 
Proposta alle aziende 2012 2013 nov12
Proposta alle aziende 2012 2013 nov12Proposta alle aziende 2012 2013 nov12
Proposta alle aziende 2012 2013 nov12
 
Boosting your child’s self esteem
Boosting your child’s self esteemBoosting your child’s self esteem
Boosting your child’s self esteem
 
Gestión tactica de fallas criticas con análisis probabilístico
Gestión tactica de fallas criticas con análisis probabilísticoGestión tactica de fallas criticas con análisis probabilístico
Gestión tactica de fallas criticas con análisis probabilístico
 
10 21 mz-vp
10 21 mz-vp10 21 mz-vp
10 21 mz-vp
 
London winter2013 peopleto_peopleambassadorprogram
London winter2013 peopleto_peopleambassadorprogramLondon winter2013 peopleto_peopleambassadorprogram
London winter2013 peopleto_peopleambassadorprogram
 
access-control-week-2
access-control-week-2access-control-week-2
access-control-week-2
 
Hemophilia
HemophiliaHemophilia
Hemophilia
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National Library
 
quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556quy_trinh_ky_thuat_cay_cao_su_2_5556
quy_trinh_ky_thuat_cay_cao_su_2_5556
 
2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuse2012 october-1-boe-child-abuse
2012 october-1-boe-child-abuse
 
3d manuu1
3d manuu13d manuu1
3d manuu1
 
Discover La Cianella
Discover La CianellaDiscover La Cianella
Discover La Cianella
 
eBook Unified Communications
eBook Unified Communications eBook Unified Communications
eBook Unified Communications
 
Моя перша презентація - 5 клас
Моя перша презентація - 5 класМоя перша презентація - 5 клас
Моя перша презентація - 5 клас
 
Capitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebroCapitulo ix.relación de la mente y el cerebro
Capitulo ix.relación de la mente y el cerebro
 

Similar a SCAPE Presentation at the Elag2013 conference in Gent/Belgium

LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada
 
OGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation PlatformsOGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation Platformsterradue
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020GEO Analytics Canada
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans cneudecker
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Cloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hub
Cloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hubCloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hub
Cloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hubBjörn Backeberg
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 

Similar a SCAPE Presentation at the Elag2013 conference in Gent/Belgium (20)

LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020GEO Analytics Canada Overview April 2020
GEO Analytics Canada Overview April 2020
 
OGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation PlatformsOGC Interfaces in Thematic Exploitation Platforms
OGC Interfaces in Thematic Exploitation Platforms
 
Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020Geo Analytics Canada Overview - May 2020
Geo Analytics Canada Overview - May 2020
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Cloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hub
Cloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hubCloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hub
Cloud Computing Needs for Earth Observation Data Analysis: EGI and EOSC-hub
 
Benchmarking of distributed linked data streaming systems
Benchmarking of distributed linked data streaming systemsBenchmarking of distributed linked data streaming systems
Benchmarking of distributed linked data streaming systems
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Introduction to GIS
Introduction to GISIntroduction to GIS
Introduction to GIS
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

SCAPE Presentation at the Elag2013 conference in Gent/Belgium

  • 1. Sven Schlarb Austrian National Library Elag 2013 Gent, Belgium, May 29, 2013 An open source infrastructure for preserving large collections of digital objects The SCAPE project at the Austrian National Library
  • 2. • SCAPE project overview • Application areas at the Austrian National Library • Web Archiving • Austrian Books Online • SCAPE at the Austrian National Library • Hardware set-up • Open source software architecture • Application Scenarios • Lessons learnt 2 Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 3. • Ability to process large and complex data sets in preservation scenarios • Increasing amount of data in data centers and memory institutions Motivation This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Volume, Velocity, and Variety of data 1970 2000 2030 cf. Jisc (2012) Activity Data: Delivering benefits from the data deluge. available at http://www.jisc.ac.uk/publications/reports/2012/activity-data-delivering-benefits.aspx
  • 4. • “Big data” is a buzzword, just a vague idea • No definitive GB, TB, PB, (…) threshold where data becomes “big data”, depends on the institutional context • Massive growth of data to be stored and processed • Situation where solutions that are usually employed do not fulfill new scalability requirements • Pushing the limit of conventional data base solutions • Simple batch processing becomes tedious or even impossible „Big Data“ in a Library context? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 5. 5 • EU-funded FP7 project, lead by Austrian Institute of Technology • Consortium: 16 Partners • National Libraries • Data Centers and Memory Institutions • Research institutes and Universities • Commercial partners • Started 2011, runs until mid/end-2014 SCAPE Consortium This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 6. Takeup •Stakeholders and Communities •Dissemination •Training Activities •Sustainability SCAPE Project Overview Platform •Automation •Workflows •Parallelization •Virtualization
  • 7. MapReduce/Hadoop in a nutshell 7 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Task1 Task 2 Task 3 Output data Aggregated Result Aggregated Result Aggregated Result Aggregated Result
  • 8. Experimental Cluster Job TrackerTask Trackers Data Nodes Name Node CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading cores) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (performance) – 2 TB effective •Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for operating system.  25 processing cores for Map tasks and  10 cores for Reduce tasks CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (redundancy) – 2 TB effective
  • 9. • Access via REST API • Workflow engine for complex jobs • Hive as the frontend for analytic queries • MapReduce/Pig for Extraction, Transform, and Load (ETL) • „Small“ objects in HDFS or HBase • „Large “ Digital objects stored on NetApp Filer 9 Platform Architecture This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Taverna Workflow engine REST API
  • 10. • Web Archiving • Scenario 1: Web Archive Mime Type Identification • Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance Application scenarios
  • 11. • Physical storage 19 TB • Raw data 32 TB • Number of objects 1.241.650.566 • Domain harvesting • Entire top-level-domain .at every 2 years • Selective harvesting • Important websites that change regularly • Event harvesting • Special occasions and events (e.g. elections) Key Data Web Archiving
  • 12. (W)ARC Container JPG GIF HTM HTM MID (W)ARC InputFormat (W)ARC RecordReader based on HERITRIX Web crawler read/write (W)ARC MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1 Scenario 1: Web Archive Mime Type Identification
  • 13. TIKA 1.0DROID 6.01 Scenario 1: Web Archive Mime Type Identification
  • 14. • Public private partnership with Google • Only public domain • Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages • ~ 70 project team members • 20+ in core team • ~ 130K physical volumes scanned so far • ~ 40 Mio pages Key Data Austrian Books Online
  • 15. 15 ADOCO (Austrian Books Online Download & Control) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). https://confluence.ucop.edu/display/Curation/PairTree Google Public Private Partnership ADOCO
  • 16. • Task: Image file format migration • TIFF to JPEG2000 migration • Objective: Reduce storage costs by reducing the size of the images • JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000 file format obsolescense risk • Challenges: • Integrating validation, migration, and quality assurance • Computing intensive quality assurance Scenario 2: Image file format migration
  • 17. • Task: Compare different versions of the same book • Images have been manipulated (cropped, rotated) and stored in different locations • Images come from different scanning sources or were subject to different modification procedures • Challenges: • Computing intensive (Average runtime per book on a single quad-core server ~ 4,5 hours) • 130.000 books, ~320 pages each • SCAPE tool: Matchbox Scenario 2: Comparison of book derivatives
  • 18. • ETL Processing of 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs • Hadoop-Streaming-API • Hadoop Map/Reduce • Hive • Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big- data-processing-chaining-hadoop-jobs-using-taverna Scenario 3: MapReduce in Quality Assurance
  • 19. 19 • Create input text files containing file paths (JP2 & HTML) • Read image metadata using Exiftool (Hadoop Streaming API) • Create sequence file containing all HTML files • Calculate average block width using MapReduce • Load data in Hive tables • Execute SQL test query Scenario 3: MapReduce in Quality Assurance
  • 20. 20 find /NAS/Z119585409/00000001.jp2 /NAS/Z119585409/00000002.jp2 /NAS/Z119585409/00000003.jp2 … /NAS/Z117655409/00000001.jp2 /NAS/Z117655409/00000002.jp2 /NAS/Z117655409/00000003.jp2 … /NAS/Z119585987/00000001.jp2 /NAS/Z119585987/00000002.jp2 /NAS/Z119585987/00000003.jp2 … /NAS/Z119584539/00000001.jp2 /NAS/Z119584539/00000002.jp2 /NAS/Z119584539/00000003.jp2 … /NAS/Z119599879/00000001.jp2l /NAS/Z119589879/00000002.jp2 /NAS/Z119589879/00000003.jp2 ... ... NAS reading files from NAS 1,4 GB 1,2 GB 60.000 books (24 Million pages): ~ 5 h + ~ 38 h = ~ 43 h Jp2PathCreator HadoopStreamingExiftoolRead Z119585409/00000001 2345 Z119585409/00000002 2340 Z119585409/00000003 2543 … Z117655409/00000001 2300 Z117655409/00000002 2300 Z117655409/00000003 2345 … Z119585987/00000001 2300 Z119585987/00000002 2340 Z119585987/00000003 2432 … Z119584539/00000001 5205 Z119584539/00000002 2310 Z119584539/00000003 2134 … Z119599879/00000001 2312 Z119589879/00000002 2300 Z119589879/00000003 2300 ... Reading image metadata
  • 22. 22 Z119585409/00000001 Z119585409/00000002 Z119585409/00000003 Z119585409/00000004 Z119585409/00000005 ... Z119585409/00000001 2100 Z119585409/00000001 2200 Z119585409/00000001 2300 Z119585409/00000001 2400 Z119585409/00000002 2100 Z119585409/00000002 2200 Z119585409/00000002 2300 Z119585409/00000002 2400 Z119585409/00000003 2100 Z119585409/00000003 2200 Z119585409/00000003 2300 Z119585409/00000003 2400 Z119585409/00000004 2100 Z119585409/00000004 2200 Z119585409/00000004 2300 Z119585409/00000004 2400 Z119585409/00000005 2100 Z119585409/00000005 2200 Z119585409/00000005 2300 Z119585409/00000005 2400 Z119585409/00000001 2250 Z119585409/00000002 2250 Z119585409/00000003 2250 Z119585409/00000004 2250 Z119585409/00000005 2250 Map Reduce HadoopAvBlockWidthMapReduce SequenceFile Textfile Calculate average block width using MapReduce 60.000 books (24 Million pages): ~ 6 h
  • 23. 23 HiveLoadExifData & HiveLoadHocrData jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 htmlwidth jp2width Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 CREATE TABLE jp2width (hid STRING, jwidth INT) CREATE TABLE jp2width (hid STRING, jwidth INT) CREATE TABLE htmlwidth (hid STRING, hwidth INT) CREATE TABLE htmlwidth (hid STRING, hwidth INT) Analytic Queries
  • 24. 24 HiveSelect jid jwidth Z119585409/00000001 2250 Z119585409/00000002 2150 Z119585409/00000003 2125 Z119585409/00000004 2125 Z119585409/00000005 2250 hid hwidth Z119585409/00000001 1870 Z119585409/00000002 2100 Z119585409/00000003 2015 Z119585409/00000004 1350 Z119585409/00000005 1700 htmlwidthjp2width jid jwidth hwidth Z119585409/00000001 2250 1870 Z119585409/00000002 2150 2100 Z119585409/00000003 2125 2015 Z119585409/00000004 2125 1350 Z119585409/00000005 2250 1700 select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid Analytic Queries
  • 25. • Emergence of new options for creating large-scale storage and processing infrastructures • HDFS as storage master or staging area? • Create a local cluster or rent a cloud infrastructure? • Apache Hadoop offers a stable core for building a large scale processing platform that is ready to be used in production • Important to select carefully additional components from the Apache Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn, Ambari, etc.) that fit your needs 25 Lessons learnt This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 26. Further information •Project website: www.scape-project.eu •Github repository: www.github.com/openplanets •Project Wiki: www.wiki.opf-labs.org/display/SP/Home SCAPE tools mentioned •SCAPE Platform • http://www.scape-project.eu/publication/an-architectural-overview- of-the-scape-preservation-platform •Jpylyzer – Jpeg2000 validation • http://www.openplanetsfoundation.org/software/jpylyzer •Matchbox – Image comparison • https://github.com/openplanets/scape/tree/master/pc-qa-matchbox Thank you! Questions?