SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Content Profiling and C3PO
Artur Kulmukhametov
Vienna University of Technology
SCAPE PW Training Event
Aarhus, 13-14 November 2013
Agenda

• Motivation: collection scale and heterogeneity
• An approach to getting a control
• Characterisation tools
• C3PO, a tool for content profiling

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

2
What is it?
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

3
Large Synoptic Survey Telescope
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

30
Terabytes
of data
nightly
4
Variety of Data

• Personal
• Cultural Heritage
• Scientific Data
• Government Documents
• …. a huge variety of formats and information

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

5
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

6
Conclusions?

….. that’s a lot of data ……
Do you know what that data is?
Do you want to do something with it?

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

7
Place for Characterization
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

8
Characterization
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

9
Characterization
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

10
Characterization
*

! One size does not fit all !
* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

11
Scalability
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

12
Tools for Characterization

fido
Exif
jpylyzer
ffident

Exiftool

Droid
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

13
A few Problems…
• A lot of tools to manage and invoke
• Different output schemas
• Different configuration/environments
• No or bad higher level management
• Difficult to spot differences

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

14
File Information Tool Set

• FITS is a software designed to identify, validate, and
extract technical metadata for various file formats
• By Harvard University Library in 2009
• v0.6.2, LGPL
• Wraps other tools
• New version every 6-12 months
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

15
File Information Tool Set
Main features:

FITS includes:

• Consolidates output

• Droid

• Can include raw output

• Metadata Extra

• Configurable/Extendable

• Jhove
• Exiftool

http://code.google.com/p/fits/

• FFident
• File Utility

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

16
FITS Output
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://hul.harvard.edu/
ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.0" timestamp="12/27/11 10:49 AM">
<identification>

format="Portable Document Format" mimetype="application/pdf"

<identity
toolname="FITS" toolversion="0.6.0">
<tool toolname="Jhove" toolversion="1.5" />
<tool toolname="file utility" toolversion="5.03" />
<tool toolname="Exiftool" toolversion="7.74" />
<tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />
<tool toolname="ffident" toolversion="0.2" />
<

version toolname="Jhove" toolversion="1.5">1.4</version>

<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/18</externalIdentifier>
</identity>
</identification>
<fileinfo>
<

size toolname="Jhove" toolversion="1.5">39586</size>

<creatingApplicationName toolname="NLNZ Metadata Extractor" toolversion="3.4GA"
status="SINGLE_RESULT">/XPP</creatingApplicationName>
<lastmodified toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2011:12:27 10:44:28+01:00</lastmodified>
<created toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2002:04:25 13:02:24Z</created>
<filepath toolname="OIS File Information" toolversion="0.1"
status="SINGLE_RESULT">/home/petrov/taverna/tmp/000/000009.pdf</filepath>
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

17
FITS Output Conflict
<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance"
xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd"
version="0.6.1“ timestamp="7/21/12 3:51 PM">
<identification

status="CONFLICT“ >

<identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1">
<tool toolname="Jhove" toolversion="1.5" />
</identity>
<identity format="Rich Text Format" mimetype="application/rtf,

text/rtf" toolname="FITS" toolversion="0.6.1">

<tool toolname="Droid" toolversion="3.0" />
<version toolname="Droid" toolversion="3.0" status="CONFLICT">1.5</version>
<version toolname="Droid" toolversion="3.0" status="CONFLICT">1.6</version>
<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/50</externalIdentifier>
<externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/51</externalIdentifier>
</identity>
<identity format="Rich Text Format" mimetype="text/rtf" toolname="FITS" toolversion="0.6.1">
<tool toolname="ffident" toolversion="0.2" />
</identity>
</identification>

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

18
Conflicts
3 types of conflicts:
1. Inconsistent property naming,
e.g: image_width and imagewidth
2. Competing characterisation results,
e.g: tool1 identifies a file as plain text, but
tool2 identifies the file as PDF
3. Close, but not the same property values,
e.g: application/xhtml+xml vs. application/xml.

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

19
Yet Another?
Advantages
• All-in-one
• Unified output schema
• Broad type coverage
Disadvantages
• Consolidation is hard
• Low performance: runs all the tools on every file
• Conflicts
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

20
Content Profiling
• Global View of Content
• Distribution of characteristics
• Statistics (size, min, max, …)
• Sampling
*

* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

21
Representative Sampling
*

• Based upon metadata
• Outliers identification
• As few as possible, as many as
necessary
• Stratification across file type, size,
time or any other relevant
characteristic for the use case
* - E. Poltorak, Representative sampling, Flickr, http://www.flickr.com/photos/44461316@N08/4110321514/, 2009
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

22
Clever, Crafty Content Profiling of Objects
C3PO is a tool for content profile generation.
• Uses characterization results
• Deeper content analysis with nice visuals
through the web-app
• Generates content profiles (map/reduce)

*

Sometimes, I don’t
understand human
behavior?!
http://github.com/openplanets/c3po
* - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

23
Clever, Crafty Content Profiling of Objects
• CLI-app
• Parses and processes FITS,
Apache Tika files
• Stores data in mongoDB
• Output: XML Profile + CSV
• Support new adaptors

• Web-app
• Overview and Browsing
• Filtering
• Representative Sample Set
Generation
• REST API (Scout)
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

24
C3PO: Representative Samples
Size'o'Matic 3000

DistSampler
**

*

SysSampler
* -- Statistical Consultants Ltd, http://www.statisticalconsultants.co.nz/weeklyfeatures/WF7.html, 2013
** D. Lane, Online Statistics Education, http://onlinestatbook.com/2/sampling_distributions/samp_dist_mean.html, 2013
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

25
C3PO: Performance
• CPU: 2.3GHz 2-core, RAM: 4GB, HDD.
• CLI + Web-app
• Govdocs1
•
•
•
•

945699 FITS files
ingest - 1h 48m
profile - 12 minutes
112 different object properties

• Internet Memory Web Archive Data
•
•
•
•

958638 FITS files
ingest - 2h 58m
profile - 13.5 minutes
105 different object properties
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

26
C3PO: Performance
• CPU: 2.3GHz 2-core, RAM: 4GB, HDD.
• CLI + noDB adaptor (not publicly available yet)
• SB (Denmark) dataset - 12 TB of data
•
•
•
•

563M FITS files
no ingest
profile - 49 hours
5314 different object properties

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

27
C3PO: Roadmap
• Conflict reduction
• Conflicts of type 2 are solved

• Use the PW ontology for an alignment with other tools
• Consistent naming of properties, values, measures
• The ontology will solve conflicts of type 1

• Data Connector API
• A common interface to interact with repositories
This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

28
Summary

• Characterization is time consuming
• It can be faulty
• Know your tools
• A tool for content profiling? C3PO!

This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

29

Más contenido relacionado

Similar a Content profiling and C3PO

Similar a Content profiling and C3PO (20)

SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
SCAPE general presentation
SCAPE general presentationSCAPE general presentation
SCAPE general presentation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National Library
 
Barbara Sierman: Policy levels in SCAPE
Barbara Sierman: Policy levels in SCAPEBarbara Sierman: Policy levels in SCAPE
Barbara Sierman: Policy levels in SCAPE
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
 
Automatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the WebAutomatic Preservation Watch Using Information Extraction on the Web
Automatic Preservation Watch Using Information Extraction on the Web
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
EOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introductionEOSC-hub and OpenAIRE Advance webinar - introduction
EOSC-hub and OpenAIRE Advance webinar - introduction
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 

Más de SCAPE Project

Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
SCAPE Project
 

Más de SCAPE Project (18)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Content profiling and C3PO

  • 1. Content Profiling and C3PO Artur Kulmukhametov Vienna University of Technology SCAPE PW Training Event Aarhus, 13-14 November 2013
  • 2. Agenda • Motivation: collection scale and heterogeneity • An approach to getting a control • Characterisation tools • C3PO, a tool for content profiling This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 2
  • 3. What is it? * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 3
  • 4. Large Synoptic Survey Telescope * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 30 Terabytes of data nightly 4
  • 5. Variety of Data • Personal • Cultural Heritage • Scientific Data • Government Documents • …. a huge variety of formats and information This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 5
  • 6. * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 6
  • 7. Conclusions? ….. that’s a lot of data …… Do you know what that data is? Do you want to do something with it? This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 7
  • 8. Place for Characterization * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 8
  • 9. Characterization * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 9
  • 10. Characterization * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 10
  • 11. Characterization * ! One size does not fit all ! * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 11
  • 12. Scalability * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 12
  • 13. Tools for Characterization fido Exif jpylyzer ffident Exiftool Droid This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 13
  • 14. A few Problems… • A lot of tools to manage and invoke • Different output schemas • Different configuration/environments • No or bad higher level management • Difficult to spot differences This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 14
  • 15. File Information Tool Set • FITS is a software designed to identify, validate, and extract technical metadata for various file formats • By Harvard University Library in 2009 • v0.6.2, LGPL • Wraps other tools • New version every 6-12 months This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 15
  • 16. File Information Tool Set Main features: FITS includes: • Consolidates output • Droid • Can include raw output • Metadata Extra • Configurable/Extendable • Jhove • Exiftool http://code.google.com/p/fits/ • FFident • File Utility This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 16
  • 17. FITS Output <fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.0" timestamp="12/27/11 10:49 AM"> <identification> format="Portable Document Format" mimetype="application/pdf" <identity toolname="FITS" toolversion="0.6.0"> <tool toolname="Jhove" toolversion="1.5" /> <tool toolname="file utility" toolversion="5.03" /> <tool toolname="Exiftool" toolversion="7.74" /> <tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" /> <tool toolname="ffident" toolversion="0.2" /> < version toolname="Jhove" toolversion="1.5">1.4</version> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/18</externalIdentifier> </identity> </identification> <fileinfo> < size toolname="Jhove" toolversion="1.5">39586</size> <creatingApplicationName toolname="NLNZ Metadata Extractor" toolversion="3.4GA" status="SINGLE_RESULT">/XPP</creatingApplicationName> <lastmodified toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2011:12:27 10:44:28+01:00</lastmodified> <created toolname="Exiftool" toolversion="7.74" status="SINGLE_RESULT">2002:04:25 13:02:24Z</created> <filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/home/petrov/taverna/tmp/000/000009.pdf</filepath> This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 17
  • 18. FITS Output Conflict <?xml version="1.0" encoding="UTF-8"?> <fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1“ timestamp="7/21/12 3:51 PM"> <identification status="CONFLICT“ > <identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1"> <tool toolname="Jhove" toolversion="1.5" /> </identity> <identity format="Rich Text Format" mimetype="application/rtf, text/rtf" toolname="FITS" toolversion="0.6.1"> <tool toolname="Droid" toolversion="3.0" /> <version toolname="Droid" toolversion="3.0" status="CONFLICT">1.5</version> <version toolname="Droid" toolversion="3.0" status="CONFLICT">1.6</version> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/50</externalIdentifier> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/51</externalIdentifier> </identity> <identity format="Rich Text Format" mimetype="text/rtf" toolname="FITS" toolversion="0.6.1"> <tool toolname="ffident" toolversion="0.2" /> </identity> </identification> This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 18
  • 19. Conflicts 3 types of conflicts: 1. Inconsistent property naming, e.g: image_width and imagewidth 2. Competing characterisation results, e.g: tool1 identifies a file as plain text, but tool2 identifies the file as PDF 3. Close, but not the same property values, e.g: application/xhtml+xml vs. application/xml. This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 19
  • 20. Yet Another? Advantages • All-in-one • Unified output schema • Broad type coverage Disadvantages • Consolidation is hard • Low performance: runs all the tools on every file • Conflicts This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 20
  • 21. Content Profiling • Global View of Content • Distribution of characteristics • Statistics (size, min, max, …) • Sampling * * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 21
  • 22. Representative Sampling * • Based upon metadata • Outliers identification • As few as possible, as many as necessary • Stratification across file type, size, time or any other relevant characteristic for the use case * - E. Poltorak, Representative sampling, Flickr, http://www.flickr.com/photos/44461316@N08/4110321514/, 2009 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 22
  • 23. Clever, Crafty Content Profiling of Objects C3PO is a tool for content profile generation. • Uses characterization results • Deeper content analysis with nice visuals through the web-app • Generates content profiles (map/reduce) * Sometimes, I don’t understand human behavior?! http://github.com/openplanets/c3po * - P. Petrov, Content Profiling and Planning, SCAPE Training Event. Guimarães, 2012 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 23
  • 24. Clever, Crafty Content Profiling of Objects • CLI-app • Parses and processes FITS, Apache Tika files • Stores data in mongoDB • Output: XML Profile + CSV • Support new adaptors • Web-app • Overview and Browsing • Filtering • Representative Sample Set Generation • REST API (Scout) This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 24
  • 25. C3PO: Representative Samples Size'o'Matic 3000 DistSampler ** * SysSampler * -- Statistical Consultants Ltd, http://www.statisticalconsultants.co.nz/weeklyfeatures/WF7.html, 2013 ** D. Lane, Online Statistics Education, http://onlinestatbook.com/2/sampling_distributions/samp_dist_mean.html, 2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 25
  • 26. C3PO: Performance • CPU: 2.3GHz 2-core, RAM: 4GB, HDD. • CLI + Web-app • Govdocs1 • • • • 945699 FITS files ingest - 1h 48m profile - 12 minutes 112 different object properties • Internet Memory Web Archive Data • • • • 958638 FITS files ingest - 2h 58m profile - 13.5 minutes 105 different object properties This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 26
  • 27. C3PO: Performance • CPU: 2.3GHz 2-core, RAM: 4GB, HDD. • CLI + noDB adaptor (not publicly available yet) • SB (Denmark) dataset - 12 TB of data • • • • 563M FITS files no ingest profile - 49 hours 5314 different object properties This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 27
  • 28. C3PO: Roadmap • Conflict reduction • Conflicts of type 2 are solved • Use the PW ontology for an alignment with other tools • Consistent naming of properties, values, measures • The ontology will solve conflicts of type 1 • Data Connector API • A common interface to interact with repositories This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 28
  • 29. Summary • Characterization is time consuming • It can be faulty • Know your tools • A tool for content profiling? C3PO! This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 29