SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Asger Askov Blekinge
State and University Library, Denmark
SCAPE Information Day
State and University Library, Denmark, June 25th
2014
Integrating the Fedora based
DOMS repository with
Hadoop
• Each File is stored in Bit Magasinet, our bit preservation
storage system.
• Each Record is stored in DOMS and have have reference
to the File in Bit Magasinet
• Can Hadoop be added to this setup?
2
Our Repositories
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Rule 1: The size of the Hadoop cluster should be
independent of the size of the data storage
• The reading of data should happen from local disks. This
prevents a central storage system from limiting the
speed of the cluster
• With this restriction, the number of nodes in the cluster
can keep growing
• Without, the cluster will reach a point where it will
overload the central storage system.
3
Hadoop Data Locality
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Repositories, especially Fedora 3.x, are single headed.
You cannot add more machines to the repository to
increase the performance.
• If Hadoop accesses the repository directly, it will be
limited to the speed of the repository.
4
Repositories (DOMS) and Hadoop
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Hadoop provides it's own bit archive system in the form
of HDFS, which is integrated with the cluster
• We do not use this. We have built our own system
instead, Bit Magasinet
• We can handle many more files because we use
magnetic tapes, rather than disks.
• But: it require us to request a number of files, which will
then be made available for Hadoop.
5
Bit archive systems and Hadoop
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Hadoop does not play nice with DOMS or Bit magasinet
• This state of affairs is not acceptable to us.
• Besides, it is a nice challenge ;)
6
State and University Library
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
7
How we do it in the Newspaper digitisation project
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
• Files are stored in Bit Magasinet
• One Batch Object
– Batch object have list of files
• One Record per File
• A Hadoop map/reduce job is split into two steps
– Map, where the work on each “record” is
performed.
– Reduce, where the results are collated
• In the Map step, we run the tool on the file.
– We have a lot of Map workers.
• In the Reduce step, we store the results in the
repository.
– We have only a few Reduce workers.
8
How we do it in the Newspaper digitisation project
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
Retrieve the list of files from DOMS
●
Request these files from Bit Magasinet
●
Start Hadoop job on files
– Map: Run Jpylyzer on each file (Many worker nodes)
– Reduce: Store the results back in DOMS (Few worker nodes)
●
This way, the actual work on the records is not
connected to DOMS, and we can scale the cluster
9
How we do it in the Newspaper digitisation project
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
Staging: Retrieve the records from DOMS to an archive
file
●
Hadooping: Hadoop reads the records, work and writes
new, updated records to the archive file
●
Loading: Store the updated records in DOMS
10
How we do it in SCAPE
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
SCAPE has devised a repository agnostic object format
based on METS
– github.com/openplanets/scape-platform-datamodel
●
SCAPE has designed a generic repository REST interface
– github.com/openplanets/scape-apis
●
SB has implemented the SCAPE Repo API for DOMS
– github.com/statsbiblioteket/scape-doms-data-connector
●
We have implemented a client for the SCAPE Repo API
– github.com/statsbiblioteket/scape-stager-loader
11
Step 1 – Retrieve records
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
12
SCAPE Datamodel mapping
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
<mets:mets ID="scape-entity:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" OBJID="scape-entity:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" PROFILE="scape">
<mets:metsHdr RECORDSTATUS="NEW"/>
<mets:dmdSec ID="DMD-8c72c14d-475a-49a2-9f43-321732c4e7a2">
<mets:mdWrap MDTYPE="OTHER">
<mets:xmlData/>
</mets:mdWrap>
</mets:dmdSec>
<mets:dmdSec ID="DMD-747421f1-fc0d-4c1d-896c-9087d43b5e10">
<mets:mdWrap MDTYPE="OTHER">
<mets:xmlData>
<scape:versionMD version-number="1"/>
</mets:xmlData>
</mets:mdWrap>
</mets:dmdSec>
<mets:amdSec>
<mets:techMD ID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-SCAPE_REPRESENTATION_TECHNICAL">
<mets:mdWrap MDTYPE="OTHER">
<mets:xmlData/>
</mets:mdWrap>
</mets:techMD>
<mets:techMD ID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-scape-file:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-JPYLYZER">
<mets:mdWrap MDTYPE="OTHER">
</mets:xmlData>
</mets:mdWrap>
</mets:techMD>
</mets:amdSec>
<mets:fileSec>
<mets:fileGrp>
<mets:file ID="scape-file:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" SEQ="0" ADMID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-scape-file:uuid:1c0194a3-c5af-
4b40-b140-5ac64cfa43af-JPYLYZER" MIMETYPE="image/jp2">
<mets:FLocat xlink:href="http://bitfinder.statsbiblioteket.dk/newspapers/B400022028241-RT1_400022028241-1_1795-06-01_adresseavisen1759-1795-06-01-0006.jp2"
xlink:title="B400022028241-RT1_400022028241-1_1795-06-01_adresseavisen1759-1795-06-01-0006.jp2" LOCTYPE="URL"/>
</mets:file>
</mets:fileGrp>
</mets:fileSec>
<mets:structMap>
<mets:div TYPE="Intellectual entity">
<mets:div ID="scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" ADMID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-
SCAPE_REPRESENTATION_TECHNICAL" TYPE="Representation" xlink:label="page-image-adresseavisen1759-1795-06-01-0007A">
<mets:fptr FILEID="scape-file:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af"/>
</mets:div>
</mets:div>
</mets:structMap>
</mets:mets>
13
SCAPE Repository API
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
Get Entity – GET /entity/<entityID>
●
Update Entity – PUT /entity/<entityID>
●
Create Entity – POST /entity/<entityID>
●
And many more
14
SCAPE Repository API
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
Checkout
java -jar scape-stager-loader.jar
--id_file=identifierFile.txt
--checkoutSequenceFile="test.seqfile"
checkout
●
Commit
java -jar scape-stager-loader.jar
--commitSequenceFile="test.seqfile"
commit
15
SCAPE Stager/Loader
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
The Hadoop job is started with the sequence file as
input
●
For each record in the sequence file
– Read the record
– Do work
– Update the record in the sequence file with the result of the work
16
Step 2: Hadoop reads and updates records
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
●
The hadoop job produces a sequence file
●
For each record in the sequence file:
– Read the record into memory
– Any changed fields are updated in the corresponding
DOMS objects
This way, the actual work on the records is not
connected to DOMS, and we can scale the cluster
independently from the repository
17
Step 3: Store the updated records in DOMS
This work was partially supported by the SCAPE Project.
The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐

Más contenido relacionado

La actualidad más candente

Python on Cloud Foundry
Python on Cloud FoundryPython on Cloud Foundry
Python on Cloud FoundryIan Huston
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrought3rmin4t0r
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
An Early Evaluation of Running Spark on Kubernetes
An Early Evaluation of Running Spark on KubernetesAn Early Evaluation of Running Spark on Kubernetes
An Early Evaluation of Running Spark on KubernetesDataWorks Summit
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncnisohq
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitecturesDoug Chang
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNDataWorks Summit
 

La actualidad más candente (20)

Python on Cloud Foundry
Python on Cloud FoundryPython on Cloud Foundry
Python on Cloud Foundry
 
HDF Status and Development
HDF Status and DevelopmentHDF Status and Development
HDF Status and Development
 
Easy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAPEasy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAP
 
HDF OPeNDAP project update and demo
HDF OPeNDAP project update and demoHDF OPeNDAP project update and demo
HDF OPeNDAP project update and demo
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF-Java Overview
HDF-Java OverviewHDF-Java Overview
HDF-Java Overview
 
HDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDCHDF & HDF-EOS Data & Support at NSIDC
HDF & HDF-EOS Data & Support at NSIDC
 
TEZ-8 UI Walkthrough
TEZ-8 UI WalkthroughTEZ-8 UI Walkthrough
TEZ-8 UI Walkthrough
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 
An Early Evaluation of Running Spark on Kubernetes
An Early Evaluation of Running Spark on KubernetesAn Early Evaluation of Running Spark on Kubernetes
An Early Evaluation of Running Spark on Kubernetes
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
Status of HDF-EOS, Related Software and Tools
 Status of HDF-EOS, Related Software and Tools Status of HDF-EOS, Related Software and Tools
Status of HDF-EOS, Related Software and Tools
 
HDF OPeNDAP Project Update and Demo
HDF OPeNDAP Project Update and DemoHDF OPeNDAP Project Update and Demo
HDF OPeNDAP Project Update and Demo
 
Carpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSyncCarpenter - Wolfram Data Summit ResourceSync
Carpenter - Wolfram Data Summit ResourceSync
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
HDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSSHDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSS
 
Using IDL with Suomi NPP VIIRS Data
Using IDL with Suomi NPP VIIRS DataUsing IDL with Suomi NPP VIIRS Data
Using IDL with Suomi NPP VIIRS Data
 
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARNHadoop {Submarine} Project: Running Deep Learning Workloads on YARN
Hadoop {Submarine} Project: Running Deep Learning Workloads on YARN
 

Similar a Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information Day, 25 June 2014

SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentationSCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereGanesh Raju
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
Taverna Components: The Basics
Taverna Components: The BasicsTaverna Components: The Basics
Taverna Components: The BasicsDonal Fellows
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Barbara Sierman: Policy levels in SCAPE
Barbara Sierman: Policy levels in SCAPEBarbara Sierman: Policy levels in SCAPE
Barbara Sierman: Policy levels in SCAPEBarbara Sierman
 

Similar a Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information Day, 25 June 2014 (20)

SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentation
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data EverywhereApache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
Apache Bigtop and ARM64 / AArch64 - Empowering Big Data Everywhere
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
PyWPS Status report
PyWPS Status reportPyWPS Status report
PyWPS Status report
 
Taverna Components: The Basics
Taverna Components: The BasicsTaverna Components: The Basics
Taverna Components: The Basics
 
Hadoop pycon2011uk
Hadoop pycon2011ukHadoop pycon2011uk
Hadoop pycon2011uk
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Barbara Sierman: Policy levels in SCAPE
Barbara Sierman: Policy levels in SCAPEBarbara Sierman: Policy levels in SCAPE
Barbara Sierman: Policy levels in SCAPE
 

Más de SCAPE Project

SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPESCAPE Project
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000SCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation SCAPE Project
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPESCAPE Project
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...SCAPE Project
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...SCAPE Project
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012SCAPE Project
 

Más de SCAPE Project (16)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
 

Último

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information Day, 25 June 2014

  • 1. Asger Askov Blekinge State and University Library, Denmark SCAPE Information Day State and University Library, Denmark, June 25th 2014 Integrating the Fedora based DOMS repository with Hadoop
  • 2. • Each File is stored in Bit Magasinet, our bit preservation storage system. • Each Record is stored in DOMS and have have reference to the File in Bit Magasinet • Can Hadoop be added to this setup? 2 Our Repositories This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 3. • Rule 1: The size of the Hadoop cluster should be independent of the size of the data storage • The reading of data should happen from local disks. This prevents a central storage system from limiting the speed of the cluster • With this restriction, the number of nodes in the cluster can keep growing • Without, the cluster will reach a point where it will overload the central storage system. 3 Hadoop Data Locality This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 4. • Repositories, especially Fedora 3.x, are single headed. You cannot add more machines to the repository to increase the performance. • If Hadoop accesses the repository directly, it will be limited to the speed of the repository. 4 Repositories (DOMS) and Hadoop This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 5. • Hadoop provides it's own bit archive system in the form of HDFS, which is integrated with the cluster • We do not use this. We have built our own system instead, Bit Magasinet • We can handle many more files because we use magnetic tapes, rather than disks. • But: it require us to request a number of files, which will then be made available for Hadoop. 5 Bit archive systems and Hadoop This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 6. • Hadoop does not play nice with DOMS or Bit magasinet • This state of affairs is not acceptable to us. • Besides, it is a nice challenge ;) 6 State and University Library This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 7. 7 How we do it in the Newspaper digitisation project This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐ • Files are stored in Bit Magasinet • One Batch Object – Batch object have list of files • One Record per File
  • 8. • A Hadoop map/reduce job is split into two steps – Map, where the work on each “record” is performed. – Reduce, where the results are collated • In the Map step, we run the tool on the file. – We have a lot of Map workers. • In the Reduce step, we store the results in the repository. – We have only a few Reduce workers. 8 How we do it in the Newspaper digitisation project This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 9. ● Retrieve the list of files from DOMS ● Request these files from Bit Magasinet ● Start Hadoop job on files – Map: Run Jpylyzer on each file (Many worker nodes) – Reduce: Store the results back in DOMS (Few worker nodes) ● This way, the actual work on the records is not connected to DOMS, and we can scale the cluster 9 How we do it in the Newspaper digitisation project This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 10. ● Staging: Retrieve the records from DOMS to an archive file ● Hadooping: Hadoop reads the records, work and writes new, updated records to the archive file ● Loading: Store the updated records in DOMS 10 How we do it in SCAPE This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 11. ● SCAPE has devised a repository agnostic object format based on METS – github.com/openplanets/scape-platform-datamodel ● SCAPE has designed a generic repository REST interface – github.com/openplanets/scape-apis ● SB has implemented the SCAPE Repo API for DOMS – github.com/statsbiblioteket/scape-doms-data-connector ● We have implemented a client for the SCAPE Repo API – github.com/statsbiblioteket/scape-stager-loader 11 Step 1 – Retrieve records This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 12. 12 SCAPE Datamodel mapping This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 13. <mets:mets ID="scape-entity:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" OBJID="scape-entity:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" PROFILE="scape"> <mets:metsHdr RECORDSTATUS="NEW"/> <mets:dmdSec ID="DMD-8c72c14d-475a-49a2-9f43-321732c4e7a2"> <mets:mdWrap MDTYPE="OTHER"> <mets:xmlData/> </mets:mdWrap> </mets:dmdSec> <mets:dmdSec ID="DMD-747421f1-fc0d-4c1d-896c-9087d43b5e10"> <mets:mdWrap MDTYPE="OTHER"> <mets:xmlData> <scape:versionMD version-number="1"/> </mets:xmlData> </mets:mdWrap> </mets:dmdSec> <mets:amdSec> <mets:techMD ID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-SCAPE_REPRESENTATION_TECHNICAL"> <mets:mdWrap MDTYPE="OTHER"> <mets:xmlData/> </mets:mdWrap> </mets:techMD> <mets:techMD ID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-scape-file:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-JPYLYZER"> <mets:mdWrap MDTYPE="OTHER"> </mets:xmlData> </mets:mdWrap> </mets:techMD> </mets:amdSec> <mets:fileSec> <mets:fileGrp> <mets:file ID="scape-file:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" SEQ="0" ADMID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af-scape-file:uuid:1c0194a3-c5af- 4b40-b140-5ac64cfa43af-JPYLYZER" MIMETYPE="image/jp2"> <mets:FLocat xlink:href="http://bitfinder.statsbiblioteket.dk/newspapers/B400022028241-RT1_400022028241-1_1795-06-01_adresseavisen1759-1795-06-01-0006.jp2" xlink:title="B400022028241-RT1_400022028241-1_1795-06-01_adresseavisen1759-1795-06-01-0006.jp2" LOCTYPE="URL"/> </mets:file> </mets:fileGrp> </mets:fileSec> <mets:structMap> <mets:div TYPE="Intellectual entity"> <mets:div ID="scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af" ADMID="TMD-scape-representation:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af- SCAPE_REPRESENTATION_TECHNICAL" TYPE="Representation" xlink:label="page-image-adresseavisen1759-1795-06-01-0007A"> <mets:fptr FILEID="scape-file:uuid:1c0194a3-c5af-4b40-b140-5ac64cfa43af"/> </mets:div> </mets:div> </mets:structMap> </mets:mets> 13 SCAPE Repository API This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 14. ● Get Entity – GET /entity/<entityID> ● Update Entity – PUT /entity/<entityID> ● Create Entity – POST /entity/<entityID> ● And many more 14 SCAPE Repository API This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 15. ● Checkout java -jar scape-stager-loader.jar --id_file=identifierFile.txt --checkoutSequenceFile="test.seqfile" checkout ● Commit java -jar scape-stager-loader.jar --commitSequenceFile="test.seqfile" commit 15 SCAPE Stager/Loader This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 16. ● The Hadoop job is started with the sequence file as input ● For each record in the sequence file – Read the record – Do work – Update the record in the sequence file with the result of the work 16 Step 2: Hadoop reads and updates records This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐
  • 17. ● The hadoop job produces a sequence file ● For each record in the sequence file: – Read the record into memory – Any changed fields are updated in the corresponding DOMS objects This way, the actual work on the records is not connected to DOMS, and we can scale the cluster independently from the repository 17 Step 3: Store the updated records in DOMS This work was partially supported by the SCAPE Project. The SCAPE project is co funded by the European Union under FP7 ICT 2009.4.1 (Grant Agreement number 270137).‐ ‐