SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Active Data: Data Life Cycle Management
Across Heterogeneous Systems and
Infrastructures
Anthony Simonet, Gilles Fedak (INRIA)
Matei Ripeanu, Samer Al-Kiswany (UCB)
Kyle Chard, Ian Foster (ANL/UC)
Hot Topics in High-Performance Distributed Computing Workshop
IBM Almaden Research Center
San Jose, California
March 12, 2015
1/13
G. Fedak() Active Data March 12, 2015
Big Data ...
Huge and growing volume of information originating from multiple
sources.
!"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($)
. . . or Big Bottlenecks ?
how to scale the infrastructure ?
end-to-end performance improvement, inter-system optimization.
how to improve productivity of data-intensive scientist ?
data-oriented programming language, data quality, improve
automation and errors recovery
2/13
G. Fedak() Active Data March 12, 2015
Data Life Cycle
Definition
Data Life Cycle (DLC) is the course of operational stages through which
data pass from the time when they enter a set of systems to the time
when they leave it.
!"#$%&%'()* +,-.,("-&&%)/* 01(,2/-*
!)234&%&*
!)234&%&*
Challenges :
Expose high level view DLC across distributed systems and
infrastructures
Expose interactions between the infrastructure and the DLC (e.g
failures)
3/13
G. Fedak() Active Data March 12, 2015
Active Data
Active Data:
Allow to reason about data sets handled by heterogeneous software
and infrastructures.
A formal model that captures the essential life cycle stages and
properties: creation, deletion, faults, replication, error checking . . .
programming model to develop easily data life cycle management
applications.
Allows legacy systems to expose their intrinsic data life cycle.
4/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
•
Created
t1
Written
t2
Read
t3
t4
Terminated
Each token has a unique identifier, corresponding to the actual data
item’s.
5/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
•
Written
t2
Read
t3
t4
Terminated
A transition is fired whenever a data state changes.
5/13
G. Fedak() Active Data March 12, 2015
Active Data: Principles & Features
System programmers expose their system’s internal data life cycle with a
model based on Petri Nets.
A Life Cycle Model is made of
Places: data states
Transitions : data operations
Created
t1
•
Written
t2
Read
t3
t4
Terminated
public void handler () {
computeMD5 ();
}
Code may be plugged by clients to transitions.
It is executed whenever the transition is fired.
5/13
G. Fedak() Active Data March 12, 2015
Active Data Framework
6/13
G. Fedak() Active Data March 12, 2015
Life Cycle View
File transferFile Dataset Metadata
Guard
Code Execution
}Tagged Tokens
Notification
Framework features:
Captures data events in legacy systems
High-level life cycle-centered view of data
Single namespace for all the files,
datasets and metadata
Powerful filters based on Data Tags
Install Taggers on Transitions
Guarded Transitions : only executes on
token which have specific tags.
Publish/subscribe transitions
Custom user reaction to data progress
Custom code execution
Custom notifications (twitter, email,
gdoc, ifttt . . . )
Use Case: Advanced Photon Source
Globus Catalog Globus
Detector Local Storage Compute Cluster
1. Local
Transfer
2. Extract
Metadata
3. Globus
Transfer
4. Swift Parallel Analysis
3 to 5 TB of data per week on this detector
Raw data are pre-processed and registered in the Globus Catalog :
Data are curated by several applications
Data are shared amongst scientific user
7/13
G. Fedak() Active Data March 12, 2015
Data Surveillance Framework
4 goals (that would otherwise require a lot of scripting and hacking):
Monitoring Data Set Progress
Better Automation
Sharing & Notification
Error Discovery & Recovery
8/13
G. Fedak() Active Data March 12, 2015
APS Data Life Cycle Model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
9/13
G. Fedak() Active Data March 12, 2015
Error Detection & Recovery
10/13
G. Fedak() Active Data March 12, 2015
Example scenario
Recover from system-wide errors: faulty acquired files are detected only
after Swift fails to process them.
In this situation, the user manually:
Drops the whole dataset
Removes any associated file and metadata
Re-acquire the dataset using the same parameters
E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent
Use-case: APS data life cycle model
Created Start transfer
Terminated
End
Detector
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
End transfer
Globus transfer
Created
End
Terminated
Start transfer
Shared storage
Created
SuccessFailure
SucceededFailed
EndEnd
Terminated
Start Swift
Globus transfer
CreatedExtract
Update
TerminatedRemove
Globus Catalog
Created
Initialize
Set
End
Failure
Terminated
Derive
Swift
Data life cycle model composed of 6 systems.
Avalon June 3rd, 2014 22/30
Active Data Client
Failure
and
Recovery
Handler
remove metadata from the catalog
Filters
data
likely to
fail
Tagger
Active Data Client
Guard
Handler
run the Globus
catalog UI scripts
11/13
G. Fedak() Active Data March 12, 2015
Handler Code
TransitionHandler handler = new TransitionHandler () {
public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) {
// Get the dataset identifier
LifeCycle lc = ad. getLifeCycle (inTokens [0]);
datasetId = lc.getTokens("Shared storage.Created")[0]. getUid ();
// Remove the dataset annotations from the catalog
String url = "https :// catalog.globus.org/dataset/" + datasetId;
Runtime r = Runtime.getRuntime ();
Process p = r.exec(" catalog_client .py remove " + url);
p.waitFor ();
// Locally , remove the datasets
String path = "~/aps/" + datasetId;
FileUtils. deleteDirectory (new File(path));
// Publish the " Detector .End"
Token root = lc.getTokens("Detector.Created")[0];
ad. publishTransition ("Detector.End", lc);
// Notify the user
sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId);
}
};
HandlerGuard guard = new HandlerGuard () {
public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) {
return input [0]. hasTag(" f a i l u r e c o r r u p t e d ");
}}
ad.subscribeTo("Swift.Failure", handler , guard);
12/13
G. Fedak() Active Data March 12, 2015
Conclusion
Active Data
allows to expose Data Life Cycle across heterogeneous systems and
infrastructures
transition-based programming model for DLC management
application
Monitoring, automation, error detection & recovery
X-systems optimizations: incremental computing, data staging,
caching, throttling etc. . .
Perspectives :
Use AD to deploy data management software stack on IaaS (Asma
Ben Cheick, Heithem Abbes, Univ. Tunis)
Big Data Apache stack X-optimization (H. He, CAS, Beijing)
Volunteer & crowd computing (M. Moca, BBU, Romania)
13/13
G. Fedak() Active Data March 12, 2015
Thank you!
Questions?

Más contenido relacionado

La actualidad más candente

SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationNicolas Fränkel
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...National Institute of Informatics
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolBharath Nunepalli
 
Redirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSRedirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSBharath Nunepalli
 
balloon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesballoon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesKai Schlegel
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference InformationKai Schlegel
 
Introduction to R2DBC
Introduction to R2DBCIntroduction to R2DBC
Introduction to R2DBCRob Hedgpeth
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 
Thomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaimThomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaimRepository Fringe
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
JOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen CacheJOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen CacheNicolas Fränkel
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01jade_22
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Mark Ivan Ligason
 
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...Anirudh Prabhu
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong KongSammy Fung
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 

La actualidad más candente (20)

SnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy applicationSnowCamp - Adding search to a legacy application
SnowCamp - Adding search to a legacy application
 
Making data typing efforts or automatically detecting data types for automat...
Making data typing efforts or automatically detecting data types  for automat...Making data typing efforts or automatically detecting data types  for automat...
Making data typing efforts or automatically detecting data types for automat...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN“Open Data Web” – A Linked Open Data Repository Built with CKAN
“Open Data Web” – A Linked Open Data Repository Built with CKAN
 
Revamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation toolRevamp the tablespace reorg process with ibm db2 automation tool
Revamp the tablespace reorg process with ibm db2 automation tool
 
Redirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OSRedirected Recovery of Recovery Expert for DB2 on z/OS
Redirected Recovery of Recovery Expert for DB2 on z/OS
 
balloon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of servicesballoon: LOD forecasting - cloudy with a chance of services
balloon: LOD forecasting - cloudy with a chance of services
 
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
Introduction to R2DBC
Introduction to R2DBCIntroduction to R2DBC
Introduction to R2DBC
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 
Thomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaimThomas Krichel (Long Island University) – AuthorClaim
Thomas Krichel (Long Island University) – AuthorClaim
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
JOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen CacheJOnConf - A CDC use-case: designing an Evergreen Cache
JOnConf - A CDC use-case: designing an Evergreen Cache
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)Presentation of Gantt Chart (System Analysis and Design)
Presentation of Gantt Chart (System Analysis and Design)
 
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
A Rules-Based Service for Suggesting Visualizations to Analyze Earth Science ...
 
Use of Open Data in Hong Kong
Use of Open Data in Hong KongUse of Open Data in Hong Kong
Use of Open Data in Hong Kong
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 

Destacado

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...Gilles Fedak
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13Gilles Fedak
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data CenterGilles Fedak
 
Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsGilles Fedak
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesGilles Fedak
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingGilles Fedak
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetGilles Fedak
 
Information Management Life Cycle
Information Management Life CycleInformation Management Life Cycle
Information Management Life CycleCollabor8now Ltd
 

Destacado (8)

SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
SpeQuloS: A QoS Service for BoT Applications Using Best Effort Distributed Co...
 
Active Data PDSW'13
Active Data PDSW'13Active Data PDSW'13
Active Data PDSW'13
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Mapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, OptimizationsMapreduce Runtime Environments: Design, Performance, Optimizations
Mapreduce Runtime Environments: Design, Performance, Optimizations
 
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and PerspectivesThe iEx.ec Distributed Cloud: Latest Developments and Perspectives
The iEx.ec Distributed Cloud: Latest Developments and Perspectives
 
iExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud ComputingiExec: Blockchain-based Fully Distributed Cloud Computing
iExec: Blockchain-based Fully Distributed Cloud Computing
 
How Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the InternetHow Blockchain and Smart Buildings can Reshape the Internet
How Blockchain and Smart Buildings can Reshape the Internet
 
Information Management Life Cycle
Information Management Life CycleInformation Management Life Cycle
Information Management Life Cycle
 

Similar a Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.pptpadalamail
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Presentation for use
Presentation   for usePresentation   for use
Presentation for usebolu804
 
Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Claire Stewart
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSADatabricks
 
Presentation
PresentationPresentation
Presentationbolu804
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...Raffaele Montella
 
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...Andreas Schreiber
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slidesharebolu804
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Paris Carbone
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data StagingHenning Bergmeyer
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and ConfigurationIOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and ConfigurationBobby Curtis
 

Similar a Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures (20)

60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt60141457-Oracle-Golden-Gate-Presentation.ppt
60141457-Oracle-Golden-Gate-Presentation.ppt
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Presentation for use
Presentation   for usePresentation   for use
Presentation for use
 
Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010Preservation Metadata, CARLI Metadata Matters series, December 2010
Preservation Metadata, CARLI Metadata Matters series, December 2010
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSABuilding the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
Building the Foundations of an Intelligent, Event-Driven Data Platform at EFSA
 
Presentation
PresentationPresentation
Presentation
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
How to expand the Galaxy from genes to Earth in six simple steps (and live sm...
 
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
PyModESt: A Python Framework for Staging of Geo-referenced Data on the Coll...
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Presentation for slideshare
Presentation   for slidesharePresentation   for slideshare
Presentation for slideshare
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...Reintroducing the Stream Processor: A universal tool for continuous data anal...
Reintroducing the Stream Processor: A universal tool for continuous data anal...
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data Staging
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and ConfigurationIOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
IOUG Data Integration SIG w/ Oracle GoldenGate Solutions and Configuration
 

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Active Data: Managing Data-Life Cycle on Heterogeneous Systems and Infrastructures

  • 1. Active Data: Data Life Cycle Management Across Heterogeneous Systems and Infrastructures Anthony Simonet, Gilles Fedak (INRIA) Matei Ripeanu, Samer Al-Kiswany (UCB) Kyle Chard, Ian Foster (ANL/UC) Hot Topics in High-Performance Distributed Computing Workshop IBM Almaden Research Center San Jose, California March 12, 2015 1/13 G. Fedak() Active Data March 12, 2015
  • 2. Big Data ... Huge and growing volume of information originating from multiple sources. !"#$%&'("$#) *+'&,-./"#)0+1)*2+("2() 34(")5-$-)!"$(%"($) . . . or Big Bottlenecks ? how to scale the infrastructure ? end-to-end performance improvement, inter-system optimization. how to improve productivity of data-intensive scientist ? data-oriented programming language, data quality, improve automation and errors recovery 2/13 G. Fedak() Active Data March 12, 2015
  • 3. Data Life Cycle Definition Data Life Cycle (DLC) is the course of operational stages through which data pass from the time when they enter a set of systems to the time when they leave it. !"#$%&%'()* +,-.,("-&&%)/* 01(,2/-* !)234&%&* !)234&%&* Challenges : Expose high level view DLC across distributed systems and infrastructures Expose interactions between the infrastructure and the DLC (e.g failures) 3/13 G. Fedak() Active Data March 12, 2015
  • 4. Active Data Active Data: Allow to reason about data sets handled by heterogeneous software and infrastructures. A formal model that captures the essential life cycle stages and properties: creation, deletion, faults, replication, error checking . . . programming model to develop easily data life cycle management applications. Allows legacy systems to expose their intrinsic data life cycle. 4/13 G. Fedak() Active Data March 12, 2015
  • 5. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations • Created t1 Written t2 Read t3 t4 Terminated Each token has a unique identifier, corresponding to the actual data item’s. 5/13 G. Fedak() Active Data March 12, 2015
  • 6. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated A transition is fired whenever a data state changes. 5/13 G. Fedak() Active Data March 12, 2015
  • 7. Active Data: Principles & Features System programmers expose their system’s internal data life cycle with a model based on Petri Nets. A Life Cycle Model is made of Places: data states Transitions : data operations Created t1 • Written t2 Read t3 t4 Terminated public void handler () { computeMD5 (); } Code may be plugged by clients to transitions. It is executed whenever the transition is fired. 5/13 G. Fedak() Active Data March 12, 2015
  • 8. Active Data Framework 6/13 G. Fedak() Active Data March 12, 2015 Life Cycle View File transferFile Dataset Metadata Guard Code Execution }Tagged Tokens Notification Framework features: Captures data events in legacy systems High-level life cycle-centered view of data Single namespace for all the files, datasets and metadata Powerful filters based on Data Tags Install Taggers on Transitions Guarded Transitions : only executes on token which have specific tags. Publish/subscribe transitions Custom user reaction to data progress Custom code execution Custom notifications (twitter, email, gdoc, ifttt . . . )
  • 9. Use Case: Advanced Photon Source Globus Catalog Globus Detector Local Storage Compute Cluster 1. Local Transfer 2. Extract Metadata 3. Globus Transfer 4. Swift Parallel Analysis 3 to 5 TB of data per week on this detector Raw data are pre-processed and registered in the Globus Catalog : Data are curated by several applications Data are shared amongst scientific user 7/13 G. Fedak() Active Data March 12, 2015
  • 10. Data Surveillance Framework 4 goals (that would otherwise require a lot of scripting and hacking): Monitoring Data Set Progress Better Automation Sharing & Notification Error Discovery & Recovery 8/13 G. Fedak() Active Data March 12, 2015
  • 11. APS Data Life Cycle Model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. 9/13 G. Fedak() Active Data March 12, 2015
  • 12. Error Detection & Recovery 10/13 G. Fedak() Active Data March 12, 2015 Example scenario Recover from system-wide errors: faulty acquired files are detected only after Swift fails to process them. In this situation, the user manually: Drops the whole dataset Removes any associated file and metadata Re-acquire the dataset using the same parameters
  • 13. E.D.&R. implementationAvalon Daniel Arnaud Anthony Vincent Use-case: APS data life cycle model Created Start transfer Terminated End Detector Created SuccessFailure SucceededFailed EndEnd Terminated End transfer Globus transfer Created End Terminated Start transfer Shared storage Created SuccessFailure SucceededFailed EndEnd Terminated Start Swift Globus transfer CreatedExtract Update TerminatedRemove Globus Catalog Created Initialize Set End Failure Terminated Derive Swift Data life cycle model composed of 6 systems. Avalon June 3rd, 2014 22/30 Active Data Client Failure and Recovery Handler remove metadata from the catalog Filters data likely to fail Tagger Active Data Client Guard Handler run the Globus catalog UI scripts 11/13 G. Fedak() Active Data March 12, 2015
  • 14. Handler Code TransitionHandler handler = new TransitionHandler () { public void handler(Transition t, boolean isLocal , Token [] inTokens , Token [] outTokens) { // Get the dataset identifier LifeCycle lc = ad. getLifeCycle (inTokens [0]); datasetId = lc.getTokens("Shared storage.Created")[0]. getUid (); // Remove the dataset annotations from the catalog String url = "https :// catalog.globus.org/dataset/" + datasetId; Runtime r = Runtime.getRuntime (); Process p = r.exec(" catalog_client .py remove " + url); p.waitFor (); // Locally , remove the datasets String path = "~/aps/" + datasetId; FileUtils. deleteDirectory (new File(path)); // Publish the " Detector .End" Token root = lc.getTokens("Detector.Created")[0]; ad. publishTransition ("Detector.End", lc); // Notify the user sendEmail("user@server.com", "APS - Corrupted dataset " + datasetId); } }; HandlerGuard guard = new HandlerGuard () { public boolean accept ( Transition t , Token [] inTokens , Token [] outTokens ) { return input [0]. hasTag(" f a i l u r e c o r r u p t e d "); }} ad.subscribeTo("Swift.Failure", handler , guard); 12/13 G. Fedak() Active Data March 12, 2015
  • 15. Conclusion Active Data allows to expose Data Life Cycle across heterogeneous systems and infrastructures transition-based programming model for DLC management application Monitoring, automation, error detection & recovery X-systems optimizations: incremental computing, data staging, caching, throttling etc. . . Perspectives : Use AD to deploy data management software stack on IaaS (Asma Ben Cheick, Heithem Abbes, Univ. Tunis) Big Data Apache stack X-optimization (H. He, CAS, Beijing) Volunteer & crowd computing (M. Moca, BBU, Romania) 13/13 G. Fedak() Active Data March 12, 2015