SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Sven Schlarb
Österreichische Nationalbibliothek
LIBER Satellite Event: APARSEN & SCAPE Workshop
21 May 2014, Austrian National Library, Vienna
Application scenarios of the SCAPE project at the Austrian
National Library
• Examples of Big Data in memory institutions
• What are the SCAPE Testbeds?
• Motivation for the Austrian National Library
• Hadoop in a nutshell
• SCAPE Platform setup at the Austrian National Library
• Selected SCAPE tools
• Application scenarios
• Web Archiving
• Austrian Books Online
Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Google Books Project: 30 Million digital books
• http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched
• Europeana: Metadata about over 24 million objects
• Europeana annual report and accounts 2012, Europeana Foundation, April 2013
• Hathi Trust: 10 million volumes (over 5,6 million titles)
comprising over 3,7 billion book page images
• http://www.hathitrust.org/statistics_info
• Internet Archive: 364 billion pages, about 10 Petabyte.
• http://archive.org und http://archive.org/web/petabox.php
Books, Journals, Newspapers, Websites. Big data?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Takeup
•Stakeholders and Communities
•Dissemination
•Training Activities
•Sustainability
Platform
•Automation
•Workflows
•Parallelization
•Virtualization
SCAPE Project Overview
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Good:
• Storing structured data
• Expressive query language
• ACID, type safety
• But:
• SQL Joins not efficient at scale
• ÖNB 2011: Failed creating a complete web-archive index
using single-instance-MySQL (write performance!)
• Solution?
• Scaling vertically  Bigger servers  hardware costs!
• Scaling horizontally  Sharding  maintenance costs!
Pushing the boundaries of RDBMs (e.g. MySQL)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Hadoop means a cost-
advantage because
• It usually runs on relatively
inexpensive (commodity)
hardware
• No binding to specific
vendors
• Open-Source-Software
Comparison of storage costs
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen
für Entscheider 2014, S. 39
• Required to move data
• From NAS to Server
• To Cloud
• Multi-Terabyte senarios?
Dealing with large amounts of data
• Immediate processing
• Unified storage and
processing capabilities
• Distributed I/O
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• When dealing with large data sets it is usually easier to
bring the processor to the data than the data to the
processor
• Fine-granular parallelisation: All processing cores of
the cluster are used as processors
• Designed for failure. In large clusters hardware failure
is the norm rather than the exception
• Redundancy : Redundant storage of data blocks
(default: 3 copies)
• Data locality: Free nodes with direct access to data do
the processing
Some Basic hadoop assumptions
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
What is Hadoop (physically)?
Distributed processing (MapReduce)
Distributed Storage (HDFS)
Hadoop = MapReduce + HDFS
2 x Quad-Core-CPUs:
10 Map (parallelisation)
4 Reduce (aggregation)
4 x 1 TB hard disks with redundancy 3:
1,33 TB effective
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Configuration per CPU
Configuration of one Quad-Core-CPU (= 1 node)
4 physical cores
8 hyperthreading-cores (System „sees“ 8 cores)
OS
Map
Map
Map
Map
Map
Reduce
Reduce
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Experimental Cluster
Job TrackerTask Trackers
Data Nodes
Name Node
CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)
RAM: 16GB
DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective
• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.
 25 processing cores for Map tasks
 10 processing cores for Reduce tasks
CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)
RAM: 24GB
DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Sort
Shuffle
Merge
Input data
Input split 1
Record 1
Record 2
Record 3
Input split 2
Record 4
Record 5
Record 6
Input split 3
Record 7
Record 8
Record 9
What is Hadoop (conceptually)?
Task1
Map Reduce
Task 2
Task 3
Output data
Aggregated
Result
Aggregated
Result
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Platform instance architecture at the Austrian National Library
• Access via REST API
• Workflow engine for complex
jobs
• Hive as the frontend for
analytic queries
• MapReduce/Pig for
Extraction, Transform, and
Load (ETL)
• „Small“ objects in HDFS or
HBase
• „Large “ Digital objects stored
on NetApp Filer
Taverna Workflow engine
REST API
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Scalable
Command
Line
Processing
ToMaR
Large-
scale
content
profiling
C3PO
JPEG2000
file format
validation
Jpylyzer
Duplicate
image
detection
Matchbox
Selected SCAPE tools in various application scenarios
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Web Archiving
• Web Archive Mime Type Identification
• Characterisation of web archive data
• Austrian Books Online
• Scenario 2: Image File Format Migration
• Scenario 3: Comparison of Book Derivatives
• Scenario 4: MapReduce in Digitised Book Quality Assurance
Overview about application scenarios
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Webarchiving
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Storage: ca. 45TB
• ca. 1.7 Billion Objekts
• Domain harvesting
• Entire top-level-domain .at
every 2 years
• Selective harvesting
• Important websites that
change regularly
• Event harvesting
• Special occasions and
events (e.g. elections)
File format identification in web archives
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
(W)ARC Container
JPG
GIF
HTM
HTM
MID
(W)ARC InputFormat
(W)ARC RecordReader
Basiert auf
HERITRIX
Web Crawler
MapReduce
JPG
Apache Tika
detect MIME
Map
Reduce
image/jpg
image/jpg 1
image/gif 1
text/html 2
audio/midi 1
File format identification in web archives
Software-Integration Durchsatz(GB/min)
TIKA detector API in Map Phase 6,17 GB/min
FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min
TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min
Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min)
1 GB 10 x 100 MB 1,57 GB/min
2 GB 20 x 100 MB 2,5 GB/min
10 GB 100 x 100 MB 3,06 GB/min
20 GB 200 x 100 MB 3,40 GB/min
100 GB 1000 x 100 MB 3,71 GB/min
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Characterisation of web archive data
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Public private partnership with Google
• Only public domain
• Objective to scan ~ 600.000 Volumes
• ~ 200 Mio. pages
• ~ 70 project team members
• 20+ in core team
• ~ 200K physical volumes scanned so far
• ~ 60 Mio pages
Austrian Books Online
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
ADOCO (Austrian Books Online Download & Control)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
https://confluence.ucop.edu/display/Curation/PairTree
Google
Public Private Partnership
ADOCO
• TIFF to JPEG2000 migration
• Objective: Reduce storage costs by
reducing the size of the images
• JPEG2000 to TIFF migration
• Objective: Mitigation of the
JPEG2000 file format obsolescense
risk
• Different preservation tool
categories:
• Validation
• Migration
• Quality assurance
Quality assured image file format migration
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Comparison of book derivatives
• Compare different versions of the same book
• Images come from different scanning sources
• Images have been manipulated (cropped, rotated)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• 60.000 books, ~ 24 Million pages
• Using Taverna‘s „Tool service“ (remote ssh execution)
• Orchestration of different types of hadoop jobs
• Hadoop-Streaming-API
• Hadoop Map/Reduce
• Hive
• Workflow available on myExperiment:
http://www.myexperiment.org/workflows/3105
• See Blogpost:
http://www.openplanetsfoundation.org/blogs/2012-08-07-big-
data-processing-chaining-hadoop-jobs-using-taverna
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Bildbreite
Blockbreite
Assumption: „Significant“ difference between average blockwidth and image width
is an indicator for possible text loss due to cropping error.
Cropping errorCorrect cropping
Using MapReduce for Quality Assurance
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
• Create input text files w.
file paths (JP2 & HTML)
• Read image metadata
using Exiftool (Hadoop
Streaming API)
• Create sequence file
containing all HTML files
• Calculate average block
width using MapReduce
• Load data in Hive tables
• Execute SQL test query
• Possibility for libraries to build cost-efficient solutions
for storing large data collections
• HDFS as storage master or staging area?
• Local cluster vs. cloud?
• Apache Hadoop offers a stable core for building a large
scale processing platform; ready to be used in
production
• Carefully select additional components from the Apache
Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn,
Ambari, etc.) that fit your needs
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
Summary
These slides on Slideshare:
• http://de.slideshare.net/SvenSchlarb/
application-scenarios-of-the-scape-project-at-the-austrian-national-library
Further information
• Project website: www.scape-project.eu
• Github repository: www.github.com/openplanets
• Project Wiki: www.wiki.opf-labs.org/display/SP/Home
SCAPE tools mentioned
• ToMaR: http://openplanets.github.io/ToMaR/#
• Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer
• Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa-
matchbox
• C3PO: http://ifs.tuwien.ac.at/imp/c3po
Thank you! Questions?
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Más contenido relacionado

La actualidad más candente

RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012RIPE NCC
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and CeremonyArchiver
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3Tim Bell
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE NCC
 
The RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsThe RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsRIPE NCC
 
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERNCase Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERNHelix Nebula The Science Cloud
 
Inspire Compliant Weather Data
Inspire Compliant Weather DataInspire Compliant Weather Data
Inspire Compliant Weather DataRoope Tervo
 
Countries, IXPs and RIPE Atlas
Countries, IXPs and RIPE AtlasCountries, IXPs and RIPE Atlas
Countries, IXPs and RIPE AtlasRIPE NCC
 
Visualisation of Big Imaging Data
Visualisation of Big Imaging DataVisualisation of Big Imaging Data
Visualisation of Big Imaging DataSlava Kitaeff, PhD
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyArchiver
 
Dragon and cinder v brownbag
Dragon and cinder v brownbagDragon and cinder v brownbag
Dragon and cinder v brownbagAlon Marx
 
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia InfoboxesMining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia InfoboxesVolha Bryl
 
Opening the Path to Technical Excellence
Opening the Path to Technical ExcellenceOpening the Path to Technical Excellence
Opening the Path to Technical ExcellenceNETWAYS
 
Pic archiver stansted
Pic archiver stanstedPic archiver stansted
Pic archiver stanstedArchiver
 

La actualidad más candente (20)

RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012RIPEstat Public demo January 24, 2012
RIPEstat Public demo January 24, 2012
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3
 
Hybrid Cloud for CERN
Hybrid Cloud for CERNHybrid Cloud for CERN
Hybrid Cloud for CERN
 
RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"RIPE Atlas and IXPs "Stitchin' it up"
RIPE Atlas and IXPs "Stitchin' it up"
 
The RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPsThe RIPE NCC, Internet Measurements and IXPs
The RIPE NCC, Internet Measurements and IXPs
 
Census Update Webinar: Census Geography Tools
Census Update Webinar: Census Geography ToolsCensus Update Webinar: Census Geography Tools
Census Update Webinar: Census Geography Tools
 
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERNCase Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
Case Study: Developing and Integrating a Cloud Sourcing Strategy at CERN
 
Inspire Compliant Weather Data
Inspire Compliant Weather DataInspire Compliant Weather Data
Inspire Compliant Weather Data
 
HDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGISHDF and netCDF Data Support in ArcGIS
HDF and netCDF Data Support in ArcGIS
 
Countries, IXPs and RIPE Atlas
Countries, IXPs and RIPE AtlasCountries, IXPs and RIPE Atlas
Countries, IXPs and RIPE Atlas
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Visualisation of Big Imaging Data
Visualisation of Big Imaging DataVisualisation of Big Imaging Data
Visualisation of Big Imaging Data
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
Dragon and cinder v brownbag
Dragon and cinder v brownbagDragon and cinder v brownbag
Dragon and cinder v brownbag
 
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia InfoboxesMining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
 
HDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDCHDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDC
 
Opening the Path to Technical Excellence
Opening the Path to Technical ExcellenceOpening the Path to Technical Excellence
Opening the Path to Technical Excellence
 
Pic archiver stansted
Pic archiver stanstedPic archiver stansted
Pic archiver stansted
 
Status of HDF-EOS, Related Software and Tools
 Status of HDF-EOS, Related Software and Tools Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 

Destacado

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification toolsSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPESCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012SCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 

Destacado (14)

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 

Similar a LIBER Satellite Event, SCAPE by Sven Schlarb

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentationSCAPE Project
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Automatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebAutomatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebLuis Faria
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summitcneudecker
 
2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - final2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - finalEOSC-hub project
 
EOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionEOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionOpenAIRE
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Daycneudecker
 
ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)
ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)
ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)ESCAPE EU
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...cneudecker
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
 
Computational Storage Services (WP7 ForgetIT 1st year review)
Computational Storage Services (WP7 ForgetIT 1st year review)Computational Storage Services (WP7 ForgetIT 1st year review)
Computational Storage Services (WP7 ForgetIT 1st year review)ForgetIT Project
 
Team 10 geo dcat ap for earth observation data
Team 10 geo dcat ap for earth observation dataTeam 10 geo dcat ap for earth observation data
Team 10 geo dcat ap for earth observation dataplan4all
 

Similar a LIBER Satellite Event, SCAPE by Sven Schlarb (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentation
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Automatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebAutomatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the Web
 
IMPACT at OCR Summit
IMPACT at OCR SummitIMPACT at OCR Summit
IMPACT at OCR Summit
 
2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - final2019 05-21 egi and eosc - final
2019 05-21 egi and eosc - final
 
EOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionEOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introduction
 
IMPACT HPC Cloud Day
IMPACT HPC Cloud DayIMPACT HPC Cloud Day
IMPACT HPC Cloud Day
 
ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)
ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)
ESCAPE Kick-off meeting - JIVE, the EVN and ESCAPE (Feb 2019)
 
An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...An Experimental Workflow Development Platform for Historical Document Digitis...
An Experimental Workflow Development Platform for Historical Document Digitis...
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Computational Storage Services (WP7 ForgetIT 1st year review)
Computational Storage Services (WP7 ForgetIT 1st year review)Computational Storage Services (WP7 ForgetIT 1st year review)
Computational Storage Services (WP7 ForgetIT 1st year review)
 
Team 10 geo dcat ap for earth observation data
Team 10 geo dcat ap for earth observation dataTeam 10 geo dcat ap for earth observation data
Team 10 geo dcat ap for earth observation data
 
Benchmarking of distributed linked data streaming systems
Benchmarking of distributed linked data streaming systemsBenchmarking of distributed linked data streaming systems
Benchmarking of distributed linked data streaming systems
 

Más de SCAPE Project

Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000SCAPE Project
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation SCAPE Project
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPESCAPE Project
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...SCAPE Project
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...SCAPE Project
 

Más de SCAPE Project (8)

Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 

Último

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

LIBER Satellite Event, SCAPE by Sven Schlarb

  • 1. Sven Schlarb Österreichische Nationalbibliothek LIBER Satellite Event: APARSEN & SCAPE Workshop 21 May 2014, Austrian National Library, Vienna Application scenarios of the SCAPE project at the Austrian National Library
  • 2. • Examples of Big Data in memory institutions • What are the SCAPE Testbeds? • Motivation for the Austrian National Library • Hadoop in a nutshell • SCAPE Platform setup at the Austrian National Library • Selected SCAPE tools • Application scenarios • Web Archiving • Austrian Books Online Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 3. • Google Books Project: 30 Million digital books • http://www.nybooks.com/articles/archives/2013/apr/25/national-digital-public-library-launched • Europeana: Metadata about over 24 million objects • Europeana annual report and accounts 2012, Europeana Foundation, April 2013 • Hathi Trust: 10 million volumes (over 5,6 million titles) comprising over 3,7 billion book page images • http://www.hathitrust.org/statistics_info • Internet Archive: 364 billion pages, about 10 Petabyte. • http://archive.org und http://archive.org/web/petabox.php Books, Journals, Newspapers, Websites. Big data? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 4. Takeup •Stakeholders and Communities •Dissemination •Training Activities •Sustainability Platform •Automation •Workflows •Parallelization •Virtualization SCAPE Project Overview This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 5. • Good: • Storing structured data • Expressive query language • ACID, type safety • But: • SQL Joins not efficient at scale • ÖNB 2011: Failed creating a complete web-archive index using single-instance-MySQL (write performance!) • Solution? • Scaling vertically  Bigger servers  hardware costs! • Scaling horizontally  Sharding  maintenance costs! Pushing the boundaries of RDBMs (e.g. MySQL) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 6. • Hadoop means a cost- advantage because • It usually runs on relatively inexpensive (commodity) hardware • No binding to specific vendors • Open-Source-Software Comparison of storage costs This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Quelle: BITKOM Leitfaden Big-Data-Technologien-Wissen für Entscheider 2014, S. 39
  • 7. • Required to move data • From NAS to Server • To Cloud • Multi-Terabyte senarios? Dealing with large amounts of data • Immediate processing • Unified storage and processing capabilities • Distributed I/O This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 8. • When dealing with large data sets it is usually easier to bring the processor to the data than the data to the processor • Fine-granular parallelisation: All processing cores of the cluster are used as processors • Designed for failure. In large clusters hardware failure is the norm rather than the exception • Redundancy : Redundant storage of data blocks (default: 3 copies) • Data locality: Free nodes with direct access to data do the processing Some Basic hadoop assumptions This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 9. What is Hadoop (physically)? Distributed processing (MapReduce) Distributed Storage (HDFS) Hadoop = MapReduce + HDFS 2 x Quad-Core-CPUs: 10 Map (parallelisation) 4 Reduce (aggregation) 4 x 1 TB hard disks with redundancy 3: 1,33 TB effective This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 10. Configuration per CPU Configuration of one Quad-Core-CPU (= 1 node) 4 physical cores 8 hyperthreading-cores (System „sees“ 8 cores) OS Map Map Map Map Map Reduce Reduce This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 11. Experimental Cluster Job TrackerTask Trackers Data Nodes Name Node CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading) RAM: 16GB DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB effective • Of 16 HT cores: 5 for Map; 2 for Reduce; 1 für Betriebssystem.  25 processing cores for Map tasks  10 processing cores for Reduce tasks CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores) RAM: 24GB DISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 12. Sort Shuffle Merge Input data Input split 1 Record 1 Record 2 Record 3 Input split 2 Record 4 Record 5 Record 6 Input split 3 Record 7 Record 8 Record 9 What is Hadoop (conceptually)? Task1 Map Reduce Task 2 Task 3 Output data Aggregated Result Aggregated Result This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 13. Platform instance architecture at the Austrian National Library • Access via REST API • Workflow engine for complex jobs • Hive as the frontend for analytic queries • MapReduce/Pig for Extraction, Transform, and Load (ETL) • „Small“ objects in HDFS or HBase • „Large “ Digital objects stored on NetApp Filer Taverna Workflow engine REST API This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 14. Scalable Command Line Processing ToMaR Large- scale content profiling C3PO JPEG2000 file format validation Jpylyzer Duplicate image detection Matchbox Selected SCAPE tools in various application scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 15. • Web Archiving • Web Archive Mime Type Identification • Characterisation of web archive data • Austrian Books Online • Scenario 2: Image File Format Migration • Scenario 3: Comparison of Book Derivatives • Scenario 4: MapReduce in Digitised Book Quality Assurance Overview about application scenarios This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 16. Webarchiving This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). • Storage: ca. 45TB • ca. 1.7 Billion Objekts • Domain harvesting • Entire top-level-domain .at every 2 years • Selective harvesting • Important websites that change regularly • Event harvesting • Special occasions and events (e.g. elections)
  • 17. File format identification in web archives This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 18. (W)ARC Container JPG GIF HTM HTM MID (W)ARC InputFormat (W)ARC RecordReader Basiert auf HERITRIX Web Crawler MapReduce JPG Apache Tika detect MIME Map Reduce image/jpg image/jpg 1 image/gif 1 text/html 2 audio/midi 1 File format identification in web archives Software-Integration Durchsatz(GB/min) TIKA detector API in Map Phase 6,17 GB/min FILE als Kommandozeilen-Applikation mit MapReduce 1,70 GB/min TIKA JAR als Kommandozeilen-Applikation mit MapReduce 0,01 GB/min Datenmenge Anzahl der ARC-Dateien Durchsatz(GB/min) 1 GB 10 x 100 MB 1,57 GB/min 2 GB 20 x 100 MB 2,5 GB/min 10 GB 100 x 100 MB 3,06 GB/min 20 GB 200 x 100 MB 3,40 GB/min 100 GB 1000 x 100 MB 3,71 GB/min This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 19. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 20. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 21. Characterisation of web archive data This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 22. • Public private partnership with Google • Only public domain • Objective to scan ~ 600.000 Volumes • ~ 200 Mio. pages • ~ 70 project team members • 20+ in core team • ~ 200K physical volumes scanned so far • ~ 60 Mio pages Austrian Books Online This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 23. ADOCO (Austrian Books Online Download & Control) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). https://confluence.ucop.edu/display/Curation/PairTree Google Public Private Partnership ADOCO
  • 24. • TIFF to JPEG2000 migration • Objective: Reduce storage costs by reducing the size of the images • JPEG2000 to TIFF migration • Objective: Mitigation of the JPEG2000 file format obsolescense risk • Different preservation tool categories: • Validation • Migration • Quality assurance Quality assured image file format migration This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 25. Comparison of book derivatives • Compare different versions of the same book • Images come from different scanning sources • Images have been manipulated (cropped, rotated) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 26. • 60.000 books, ~ 24 Million pages • Using Taverna‘s „Tool service“ (remote ssh execution) • Orchestration of different types of hadoop jobs • Hadoop-Streaming-API • Hadoop Map/Reduce • Hive • Workflow available on myExperiment: http://www.myexperiment.org/workflows/3105 • See Blogpost: http://www.openplanetsfoundation.org/blogs/2012-08-07-big- data-processing-chaining-hadoop-jobs-using-taverna Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 27. Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Bildbreite Blockbreite Assumption: „Significant“ difference between average blockwidth and image width is an indicator for possible text loss due to cropping error. Cropping errorCorrect cropping
  • 28. Using MapReduce for Quality Assurance This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). • Create input text files w. file paths (JP2 & HTML) • Read image metadata using Exiftool (Hadoop Streaming API) • Create sequence file containing all HTML files • Calculate average block width using MapReduce • Load data in Hive tables • Execute SQL test query
  • 29. • Possibility for libraries to build cost-efficient solutions for storing large data collections • HDFS as storage master or staging area? • Local cluster vs. cloud? • Apache Hadoop offers a stable core for building a large scale processing platform; ready to be used in production • Carefully select additional components from the Apache Hadoop Ecosystem (HBase, Hive, Pig, Oozie, Yarn, Ambari, etc.) that fit your needs This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Summary
  • 30. These slides on Slideshare: • http://de.slideshare.net/SvenSchlarb/ application-scenarios-of-the-scape-project-at-the-austrian-national-library Further information • Project website: www.scape-project.eu • Github repository: www.github.com/openplanets • Project Wiki: www.wiki.opf-labs.org/display/SP/Home SCAPE tools mentioned • ToMaR: http://openplanets.github.io/ToMaR/# • Jpylyzer: http://www.openplanetsfoundation.org/software/jpylyzer • Matchbox: https://github.com/openplanets/scape/tree/master/pc-qa- matchbox • C3PO: http://ifs.tuwien.ac.at/imp/c3po Thank you! Questions? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).