20160922 Materials Data Facility TMS Webinar

Ben Blaiszik (blaiszik@uchicago.edu),
Kyle Chard, Rachana Ananthakrishnan
Michael Ondrejcek, Kenton McHenry
PIs: Ian Foster (foster@uchicago.edu), Steven Tuecke, John Towns
materialsdatafacility.org
globus.org
Materials Data Facility -
Data Services to Advance Materials
Science Research

2
http://dx.doi.org/10.1007/s11837-016-2001-3
MDF Article in JOM (August Issue)

MaterialsDataFacility.org
3
To get started,
contact Ben Blaiszik
blaiszik@uchicago.edu

4
Outline
APIs
• Overview
§ MDF Overview
§ Globus quick introduction
• MDF Data Publication Service
§ Key MDF data pub service features
§ Publication walk-through
• General Observations and Future
Outlook

What is MDF?
5
We are developing production services
to make it more simple for materials
datasets and resources to be ...
Published
Identified
Described
Curated
Verifiable
Accessible
Preserved
Discovered
Searched
Browsed
Shared
Recommended
Accessed
and
SRD
Publishable Results
Published Results
Resource Data
Ref Data
Derived Data
Working Data * Figure adapted from Warren et al.

Data Service Infrastructure
6
Publication Discovery
Compute for
data interaction
and viz
Resource
Registration
APIs
+ +
+ - Initial Foci

7
Publication
APIs
• Identify datasets with persistent identifiers (e.g. DOI)
• Describe datasets with appropriate metadata and
provenance
• Verify dataset contents over time
• Preserve critical datasets in a state that increases
transparency, replicability, and helps encourage reuse

8
Discovery
• Search and query datasets in modern ways – e.g. via
search against indexed metadata and harvested file
contents rather than remembering opaque file paths
Future...
Spotlight for all
data you have
access to
regardless of
location
Under Development

9
Discovery
Under Development
• SaaS cloud-hosted solution
• Logical metadata repository to index many external sources
• Flexible queries (boosting, full text, partial matches, etc.)
• Search results are limited by ACLs

10
Discovery
Under Development
• All MDF-published datasets will be indexed
• May use common schemas (Datacite, Dublin Core etc.) or
domain specific
• Globus endpoint contents may be indexed (owner enabled)
• Index has the flexibility of no required schema
• Built on Elasticsearch for proven scalability and speed,
hosted on scalable AWS resources

11
Discovery
Under Development
Custom boosting
Facets
Test-indexed data

13
Globus
Background
https://www.globus.org

Globus Platform-as-a-Service (PaaS)
14
Identity
management
User
groups
Data
transfer
Data
sharing
• Share directly from your storage
device (laptop or cluster)
• File and directory-level ACLs
• Manage user group creation and
administration flows
• Share data with user groups
• High-performance data transfer
from a web browser
• Optimize transfer settings and
verify transfer integrity
• Add your laptop to the Globus cloud
with Globus Connect Personal
• create and manage a unique
identity linked to external identities
for authentication
Publication Discovery

REST APIs, Clients, and Docs
15
• New version of core services released in Feb.
• New Python SDK available
§ https://github.com/globusonline/globus-sdk-python
• Jupyter Notebook Examples
§ https://github.com/globus/globus-jupyter-notebooks
• Sample Data Portal
§ https://github.com/globus/globus-sample-data-portal
• (alpha) MDF Data Publication Service API

Globus Background
16
B
Globus moves the
data for you
secure
endpoint,
e.g. laptop
You
submit a
transfer
request Globus
notifies you
once the
transfer is
complete
secure
endpoint,
e.g. midway
transfer
A
Endpoint
• E.g. laptop or server
running a Globus
client (e.g. Dropbox
client)
• Enables advanced file
transfer and sharing
• Currently GridFTP,
future GridFTP + HTTP
Some Key
Features
• REST API for
automation and
interoperability
• Web UI for
convenience
• Optimizes and verifies
transfers
• Handles auto-restarts
• Battle tested with big
data

Globus Web UI
17
Endpoint
• E.g. laptop or server
running a Globus
client (e.g. Dropbox
client)
• Enables advanced file
transfer and sharing
• Currently GridFTP,
future GridFTP + HTTP
Some Key
Features
• REST API for
automation and
interoperability
• Web UI for
convenience
• Optimizes and verifies
transfers
• Handles auto-restarts
• Battle tested with big
data

19
Data
Publication
Where are we
Now?

20
Materials Data Publication/Discovery is Often
a Challenge
Data Collection Data Storage and Process Publication

21
a Challenge
Data Collection
?
?
?
Networked storage, sometimes many TB
Unique identifier data for search/cite
Custom metadata descriptions
Data curation workflow
Automation capabilities
Data Storage and Process Publication
Want to
Discover / Use
Want to
Publish
Don’t put under desk!
Needed to close the loop

22
Data Collection
?
?
?
Need storage, sometimes many TB
Need to uniquely identify data for search/cite
Need custom metadata descriptions
Need a data curation workflow
Need automation capabilities
Data Storage and Process Publication
Want to
Discover / Use
Want to
Publish
a Challenge
Don’t put under desk!

Collection Model
23
• Collections might be a
research group or a research
topic...
• Collections have specified
§ Mapping to storage endpoint
§ Currently handled as automatically created
shared endpoints
§ Metadata schemas
§ Access control policies
§ Licenses
§ Curation workflows
• Collections contain
§ Datasets
§ Data
§ Metadata
• Metadata Persistence
§ Metadata log file with dataset
§ Metadata replicated in search
index

Hybrid Distributed Model
24
Petrel @Argonne
1.7 PB
BlueWaters Condo
@UIUC
100 TB
EP 1
EP 2
EP 3
Campus
RDS
DOE
Cloud Metadata Index
And Tools
Centralized resource
Globus endpoint
NSF
(XSEDE)
ElectroCat
EP

Publish Large Datasets
25
• Distributed data model leverages
Globus production capabilities for file
transfer (i.e. dataset assembly), user
authentication, and access control
groups
• 100s of TB of reliable storage @ NCSA,
and more storage at Argonne
§ Globus endpoint at ncsa#mdf on Nebula
§ Expandable to many PBs as necessary
§ Automated tape backup for reliability (in progress)
• Researchers can optionally use your
own local or institutional storage

Uniquely Identify Datasets
26
• Associate a unique identifier with a
dataset
§ DOI, Handle
• Improve dataset discovery and citability
§ Aligning incentives and understanding the culture
will be critical to driving adoption
DatasetDownloads
Time
• Your work has been cited
153 times in the last year
• Researchers from 30
institutions have
downloaded your datasets
Future...

Share Data with Flexible ACLs
27
• Share data publicly, with a set of users,
or keep data private
Leverage Curation Workflows
• Collection administrators can specify
the level of curation workflow required
for a given collection e.g.
§ No curation
§ Curation of metadata only
§ Curation of metadata and files

Customize Metadata
28
• Build a custom metadata schema for
your specific research data
• Re-use existing metadata schemas
• Working in conjunction with NIST
researchers to define these schemas
• Can we build a system that allows
schema:
§ Inheritance
§ E.g. a schema “polymers” might inherit and expand
upon the “base material” of NIST
§ Versioning
§ E.g. Understand contextually how to map fields
between versions
§ Dependence
§ E.g. Allows the ability to build consensus around
schemas
Future...

Example Use Case
30
Publishing Big, Remote Data
Collected multi TB
of data at a light source
Bundle the data with metadata
and provenance
Want a citable DOI to share the
raw and derived data with the
community
Want their data to be discoverable
by free text search and custom
metadata

MDF Collections
32
Recall: Policies Set at the Collection Level
• Required metadata, schemas
• Data storage location
• Metadata curation policies

MDF Metadata Entry
33
• Scientist or
representative
describes the data
they are submitting
• For this collection
Dublin Core and a
custom metadata
template are
required

MDF Custom Metadata
34
• Scientist or
representative
describes the data
they are submitting
• For this collection
Dublin Core and a
custom metadata
template are
required

Dataset Assembly
35
• Shared endpoint is
auto-created on
collection-specified
data store
• Scientist transfers
dataset files to a
unique publish
endpoint
• Dataset may be
assembled over any
period of time
• When submission is
finished, dataset
will be rendered
immutable via
checksum
(e.g. NU) (e.g. UIUC Nebula)

Dataset Assembly
36
• Shared endpoint is
auto-created on
collection-specified
data store
• Scientist transfers
dataset files to a
unique publish
endpoint
• Dataset may be
assembled over any
period of time
• When submission is
finished, dataset
will be rendered
immutable via
checksum
(e.g. NU) (e.g. UIUC Nebula)

Dataset Curation (Optional)
37
• Optionally specified
in collection
configuration
• Can be approved or
rejected (i.e. sent
back to the
submitter)

Mint a Permanent Identifier
38
Can be DOI or Handle

47
General
Observations
and Future
Outlook

48
Publication Year
1 Milestones
APIs
• Opened to the public in March 2016
• Provisioned reliable storage to support researchers sharing
open materials data (~200 TB)
• MDF data volume approaching ~ 6 TB of materials data
• Started building deep relationships with many of the key
materials data generating groups and communities
• Ingested dataset > 1 TB in size
• Ingested dataset > 1.5M files

Integration with the Community is Key
49
Materials Project
OQMD
Citrination
Materials
Commons
Other Facilities (APS, SNS, NSLS, …), Institutional Repositories,
Publishers!
Metadata
Publishing
MetadataMD,
Pub., Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH

Understanding Incentives is Critical
50
Meeting Award
Requirements
Smoothing
Dislocations
Increasing
Impact
• Increase paper citations1
• Add dataset citation capabilities
• [Distance] Enable simple sharing among
collaborators (near and far)
• [Personnel] Ease transitions between students
• [Format] Lessen need for ad hoc resource sharing
(e.g. via group websites)
• Simplify DMP compliance
1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]

Lessons Learned
51
• The demand is there from researchers and
institutions
• Lots of cross-over with centers and projects
§ (NIST) CHiMaD
§ (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute
§ (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke*
• Data Heterogeneity is a challenge
§ Metadata is the major sticking point
• Friction points
§ Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”}
§ Need file or directory based metadata
§ Immutable datasets alone is not enough à Versioning
§ Data gathering in retrospect
§ Schema generation and interoperability
§ Working with and following developments at NIST, RDA, Citrine et al.
§ Differing institutional approval processes
§ Lack of programmatic interface (planned).
• Support for data interactivity and visualization
• Smart versioning for large file-based datasets

Wider Data Community
52
• Curated and described datasets
• Well-posed problems
• Community to share analyses
• Challenges to start “sprints”
• Great APIs and clients
• Examples to get started
• Hundreds of video tutorials
Materials ProjectOQMD
Citrination
Materials
Commons
• Less inherently intuitive problems
• Sometimes need advanced compute
capabilities
• Often many TB

53
• Continuous integration, QA, and testing
• Containerized solutions, microservice architecture, abstracting software from
hardware
• Automation
• Internet of Things (IoT) – connect everything
• Machine Learning / AI
• Natural Language Processing (Siri, chatbots or “slack”bots, etc.)
• Search rules the world – ok this was 20 years ago…
What are the analogs and applications in the materials community?
Materials ProjectOQMD
Citrination
Materials
Commons
• Less inherently intuitive problems
• Sometimes need advanced compute
capabilities
• Often many TB
Broader Trends

54
Experimentation Ahead
No team commitments
here!
Open source opportunities,
contact:
blaiszik@uchicago.edu

Use Case: Scenario Generator-Consumer
55
• Data generator
§ Generates data periodically (perhaps from an instrument)
§ Pushes data to a public channel
§ Schema is validated before inclusion in channel stream
• Data consumer
§ Polls channel periodically
§ Wants to pull datasets by property
Dataset
Channel
MDF-composites
Data Generator
Data Consumer
DatasetDatasetDataset
DatasetDatasetcreate
q: result
q

Automated Data Aggregation
(consumer)
56

Aggregate, Perform ML
58
• Combine cloud-published dataset, scikitlearn, pandas to predict
steel fatigue and “reproduce” data from journal publication

Aggregate, Perform ML, Visualize
59
• Combine cloud-published dataset, scikitlearn, pandas to predict
steel fatigue and validate journal publication

What’s Currently Available?
60
• Web interface to support data publication (public-
facing APIs coming soon)
• 100s of TB of storage at NCSA (scalable to many PB)
more at Argonne (1.7 PB total on Petrel – not all for
materials…)
• Help with developing metadata schemas to describe
your research datasets
MDF Tutorial on Github
https://github.com/blaiszik/materials-data-facility-training

What are we looking for?
61
• Early adopters, willing to get their hands
dirty with the service and give honest
feedback
• Key integration points where metadata is
picked up automatically!
• Key datasets and resources of all sizes,
shapes, raw or derived, that might help us
understand the process better

Thanks to Our Sponsors!
62
U . S . D E P A RT M E N T O F
ENERGY

Publication REST APIs Discovery
• Identify datasets with
persistent identifiers (e.g.
DOI)
• Describe datasets with
appropriate metadata and
provenance
• Verify dataset contents over
time
• Handle big (and small) data:
We have already ingested
datasets with > 1.5M files
and > 1TB in size
• Search and query
datasets in modern ways
• Index metadata and
harvest file contents
• Simple user interfaces
(i.e., after Google and
Amazon)
Opened to external users in Mar. 2016
~ 6 TB of data published
Materialsdatafacility
.org

20160922 Materials Data Facility TMS Webinar

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a 20160922 Materials Data Facility TMS Webinar

Similar a 20160922 Materials Data Facility TMS Webinar (20)

Último

Último (20)

20160922 Materials Data Facility TMS Webinar