The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers
Software tools for data-driven research and their application to thermoelectr...
Similar a The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers
Similar a The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers (20)
The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers
1. The Materials Project: Experiences from running a
million computational materials science
simulations and sharing the results with tens of
thousands of researchers
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Lab
Berkeley, CA
MolSSI workflow workshop
Slides (already) posted to: http://www.slideshare.net/anubhavster
Input file flags
SLURM format
how to fix ZPOTRF?
q set up the structure coordinates
q write input files, double-check all
the flags
q copy to supercomputer
q submit job to queue
q deal with supercomputer
headaches
q monitor job
q fix error jobs, resubmit to queue,
wait again
q repeat process for subsequent
calculaJons in workflow
q parse output files to obtain results
q copy and organize results, e.g., into
Excel
3. Materials development is a key bottleneck
for new technologies
3
Si for solar cells
since 1950s
graphite + Li{Co,Mn,Ni}O2
for batteries since 1990
Technologies are often limited by the properties of their
component materials, but take decades to discover and about
20 years to commercialize
How can we find new materials more quickly & reliably?
4. Today, one can calculate many materials properties
from scratch with density functional theory (DFT)
4
A. Jain, Y. Shin, and K. A.
Persson, Nat. Rev. Mater.
1, 15004 (2016).
5. High-throughput DFT uses supercomputers to calculate
the properties of tens of thousands of materials
5
Automate the DFT
procedure
Supercomputing
Power
FireWorks
Software for programming
general computational
workflows that can be
scaled across large
supercomputers.
NERSC
Supercomputing center,
processor count is
~100,000 desktop
machines. Other centers
are also viable.
High-throughput
materials screening
G. Ceder & K.A.
Persson, Scientific
American (2015)
6. What we did
• We started with known databases of chemical
compositions, for which the crystal structure was
known but the properties of the material were
unknown
• We ran density functional theory simulations to predict
the properties of those materials (~65,000 compounds)
• We put the results online on a site called “The Materials
Project”
• We built APIs to the data and released our software
stack for generating new data
6
7. Materials Project database
• Online resource of density functional
theory simulation data for ~65,000
inorganic materials
• Over 35,000 registered users
– we also published a review paper
showing how people used the database
to solve real research problems
• Includes band structures, elastic
tensors, piezoelectric tensors, battery
properties and more
• RESTful API
• www.materialsproject.org – (free)
7
Jain et al. Commentary: The
Materials Project: A materials
genome approach to accelerating
materials innovation. APL Mater. 1,
11002 (2013).!
Jain, A., Persson, K. A. & Ceder, G.
Research Update: The materials
genome initiative: Data sharing and
the impact of collaborative ab initio
databases. APL Mater. 4, 53102 (2016).!
8. Many “largest ever” data sets – efforts combined are
>1 million DFT simulations!
8
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000 electronic transport
Ricci, Chen,
Aydemir, Snyder,
Rignanese, Jain,
& Hautier, Sci
Data 2017, 4,
170085.!
R. Tran, Z. Xu, B.
Radhakrishnan, D.
Winston, W. Sun, K.
A. Persson, and S. P.
Ong, Sci. Data, 2016,
3, 160080.!
>150 Wulff shapes + surface
characterizations
10. The web site is the tip of the iceberg – we’ve built and
released an entire software stack underlying the effort
10
pymatgen
FireWorks
custodian
atomate
REST API
11. A “black-box” view of performing a calculation
11
“something”!
Results!!
researcher!
What is the
GGA-PBE elasJc
tensor of GaAs?
12. Unfortunately, the inside of the “black box”
is usually tedious and “low-level”
12
lots of tedious,
low-level work…!
Results!!
researcher!
What is the
GGA-PBE elasJc
tensor of GaAs?
Input file flags
SLURM format
how to fix ZPOTRF?
q set up the structure coordinates
q write input files, double-check all
the flags
q copy to supercomputer
q submit job to queue
q deal with supercomputer
headaches
q monitor job
q fix error jobs, resubmit to queue,
wait again
q repeat process for subsequent
calculaJons in workflow
q parse output files to obtain results
q copy and organize results, e.g., into
Excel
13. What would be a better way?
13
“something”!
Results!!
researcher!
What is the
GGA-PBE elasJc
tensor of GaAs?
14. What would be a better way?
14
Results!!
researcher!
What is the
GGA-PBE elasJc
tensor of GaAs?
a button!
15. We built software for automatically doing calculations
15
(automatic materials
science workflows)
Custodian
(calculation error
recovery)
(materials analysis
framework)
Base packages
Derived packages
(workflow definition &
execution)
These are all open-source:
16. MPComplete on Materials Project works as a simple
“one-click DFT”
16
Input generation
(parameter choice)
Workflow mapping Supercomputer
submission /
monitoring
Error
handling
File Transfer
File Parsing /
DB insertion
Custom material
Submit!
www.materialsproject.org
“Crystal Toolkit”
Anyone can find, edit,
and submit (suggest)
structures
Currently, this feature is available for:
• structure optimization
• band structures
• elastic tensors
• about ~10 more in Python interface
17. MPComplete on Materials Project works as a simple
“one-click DFT”
17
Input generation
(parameter choice)
Workflow mapping Supercomputer
submission /
monitoring
Error
handling
File Transfer
File Parsing /
DB insertion
Custom material
Submit!
www.materialsproject.org
“Crystal Toolkit”
Anyone can find, edit,
and submit (suggest)
structures
Currently, this feature is available for:
• structure optimization
• band structures
• elastic tensors
• about ~10 more in Python interface
One can also use the same
infrastructure to conduct
customized research studies via a
Python interface that provides
access to high-level operations
18. Workflow parameters can be customized at
multiple levels of detail
18
1. Workflows have
various high-level
options
2. Fireworks also
have options / flags
(not shown)
3. Firetasks have
most detailed
number of options /
flags (not shown)
Example 1: “VASP input set”
controls the rules that set DFT
parameters (pseudopotentials,
cutoffs, grid densities, etc) via
pymatgen!
!
Example II: If “stability_check” is
enabled, the later parts of the
workflow are skipped if the
structure is determined unstable to
save computer time on
uninteresting structures!
19. You can build workflows from scratch or reuse
components to assemble workflows
Multiple workflows are built with the same components
stacked together in different ways like Legos
19
These two workflows reuse almost
all the same code between the
two!
20. Software allows you to leverage the prior efforts and
knowledge of many researchers past + present
20
K. Mathew J. Montoya S. Dwaraknath A. Faghaninia
All past and present knowledge, from everyone in the group,
everyone previously in the group, and our collaborators,
about how to run calculations
M. Aykol
S.P. Ong
B. Bocklund T. Smidt
H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood
Z.K. Liu J. Neaton K. Persson A. Jain
+
22. Things that worked for us (1) - BDFLs
• At first, we tried to make every coding decision by committee –
e.g., get all the developers to sit in a room and agree on a solution
• Later, we assigned a strong BDFL (benevolent dictator for life)
for each codebase that would consider all options but could
simply make decisions on behalf of that codebase
• We found it that, even though the BDFL was not always right, we
were able to progress much faster, much better, and surprisingly
with much less conflict than the old committee way
• Note: If you were BDFL of a codebase, you got to do things your
way. But you were also signing up for a ton of extra work for that
privilege. Thus, BDFLs must care a lot about the code, be very
detail oriented, and be willing to work overtime. Not everyone is a
candidate!
22
23. Things that worked for us (2) – forced collaboration
• The tendency for most scientists, at least at first, is to
write their own individual scripts their own corner
• At first, it was needed to have a strong authority figure
(i.e., center lead) force collaboration.
– “All code must go in pymatgen!” – Kristin Persson
• When the code builds enough momentum and is big /
established enough, forced collaboration can be
dropped and researchers naturally put code there.
23
24. Things that worked for us (3) - MongoDB
• When most people think databases, they think “SQL”
– We were also of that mentality from 2006-2011
• We built a beautiful, intricate schema (database blueprint)
for simulation data that was a wonder to behold
– But, only the “database master” really knew how to modify /
expand it
– Any time a new type of data needed to be included in the
database, the “database master” had to design schema updates
• A computer science colleague though we might want to
experiment with MongoDB
• Result: we can move so much faster with MongoDB due to
its flexibility and easy learning curve.
– These days, we don’t really use SQL for anything.
24
25. Things that worked for us (4) – day 1 open source
• Early in the project, we felt there was commercial and
“research advantage” value in all our automation software
– “Let’s release open source in the future, when the code is cleaner
and also we finished getting our own research mileage out of it” –
Materials Project, circa 2011
• One BDFL experimented with day 1 open-source for a new
and experimental code that rewrote a major, closed-source
legacy Java code in Python
– That code, pymatgen, grew very quickly and displaced the old
legacy code in record time. It’s been cited ~300 times in just 4
years since publication!
• Today all our codes are open source from day 1
– Incidentally, if we are not open source from day 1, we almost never
see the code become open source. The “clean it up and release as
open source later” never works for us.
25
26. Thank you!
• Prof. Kristin Persson and Prof. Gerbrand Ceder,
founders of Materials Project and their teams
• Prof. Shyue Ping Ong, pymatgen BDFL
• NERSC computing center and staff
• Funding: U.S. Department of Energy
• …and everyone who contributed to these codes!!
26
Slides (already) posted to: http://www.slideshare.net/anubhavster