SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
The Materials Project: Experiences from running a
million computational materials science
simulations and sharing the results with tens of
thousands of researchers
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Lab
Berkeley, CA
MolSSI workflow workshop
Slides (already) posted to: http://www.slideshare.net/anubhavster
Input	file	flags	
SLURM	format	
how	to	fix	ZPOTRF?	
	
		
q  set	up	the	structure	coordinates	
q  write	input	files,	double-check	all	
the	flags	
q  copy	to	supercomputer	
q  submit	job	to	queue	
q  deal	with	supercomputer	
headaches	
q  monitor	job	
q  fix	error	jobs,	resubmit	to	queue,	
wait	again	
q  repeat	process	for	subsequent	
calculaJons	in	workflow	
q  parse	output	files	to	obtain	results	
q  copy	and	organize	results,	e.g.,	into	
Excel
Talk outline
•  What we did
•  How we did it
•  Things that worked for us
2
Materials development is a key bottleneck
for new technologies
3
Si for solar cells
since 1950s
graphite + Li{Co,Mn,Ni}O2
for batteries since 1990
Technologies are often limited by the properties of their
component materials, but take decades to discover and about
20 years to commercialize
How can we find new materials more quickly & reliably?
Today, one can calculate many materials properties
from scratch with density functional theory (DFT)
4
A. Jain, Y. Shin, and K. A.
Persson, Nat. Rev. Mater.
1, 15004 (2016).
High-throughput DFT uses supercomputers to calculate
the properties of tens of thousands of materials
5
Automate the DFT
procedure
Supercomputing
Power
FireWorks
Software for programming
general computational
workflows that can be
scaled across large
supercomputers.
NERSC
Supercomputing center,
processor count is
~100,000 desktop
machines. Other centers
are also viable.
High-throughput
materials screening
G. Ceder & K.A.
Persson, Scientific
American (2015)
What we did
•  We started with known databases of chemical
compositions, for which the crystal structure was
known but the properties of the material were
unknown
•  We ran density functional theory simulations to predict
the properties of those materials (~65,000 compounds)
•  We put the results online on a site called “The Materials
Project”
•  We built APIs to the data and released our software
stack for generating new data
6
Materials Project database
•  Online resource of density functional
theory simulation data for ~65,000
inorganic materials
•  Over 35,000 registered users
–  we also published a review paper
showing how people used the database
to solve real research problems
•  Includes band structures, elastic
tensors, piezoelectric tensors, battery
properties and more
•  RESTful API
•  www.materialsproject.org – (free)
7
Jain et al. Commentary: The
Materials Project: A materials
genome approach to accelerating
materials innovation. APL Mater. 1,
11002 (2013).!
Jain, A., Persson, K. A. & Ceder, G.
Research Update: The materials
genome initiative: Data sharing and
the impact of collaborative ab initio
databases. APL Mater. 4, 53102 (2016).!
Many “largest ever” data sets – efforts combined are
>1 million DFT simulations!
8
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000 electronic transport
Ricci, Chen,
Aydemir, Snyder,
Rignanese, Jain,
& Hautier, Sci
Data 2017, 4,
170085.!
R. Tran, Z. Xu, B.
Radhakrishnan, D.
Winston, W. Sun, K.
A. Persson, and S. P.
Ong, Sci. Data, 2016,
3, 160080.!
>150 Wulff shapes + surface
characterizations
Talk outline
•  What we did
•  How we did it
•  Things that worked for us
9
The web site is the tip of the iceberg – we’ve built and
released an entire software stack underlying the effort
10
pymatgen	
	
FireWorks	
	
custodian	
	
atomate	
	
REST	API
A “black-box” view of performing a calculation
11
“something”!
Results!!
researcher!
What	is	the	
GGA-PBE	elasJc	
tensor	of	GaAs?
Unfortunately, the inside of the “black box”
is usually tedious and “low-level”
12
lots of tedious,
low-level work…!
Results!!
researcher!
What	is	the	
GGA-PBE	elasJc	
tensor	of	GaAs?	
Input	file	flags	
SLURM	format	
how	to	fix	ZPOTRF?	
	
		
q  set	up	the	structure	coordinates	
q  write	input	files,	double-check	all	
the	flags	
q  copy	to	supercomputer	
q  submit	job	to	queue	
q  deal	with	supercomputer	
headaches	
q  monitor	job	
q  fix	error	jobs,	resubmit	to	queue,	
wait	again	
q  repeat	process	for	subsequent	
calculaJons	in	workflow	
q  parse	output	files	to	obtain	results	
q  copy	and	organize	results,	e.g.,	into	
Excel
What would be a better way?
13
“something”!
Results!!
researcher!
What	is	the	
GGA-PBE	elasJc	
tensor	of	GaAs?
What would be a better way?
14
Results!!
researcher!
What	is	the	
GGA-PBE	elasJc	
tensor	of	GaAs?	
a button!
We built software for automatically doing calculations
15
	
(automatic materials
science workflows)
Custodian	
(calculation error
recovery)
	
(materials analysis
framework)
Base packages
Derived packages
	
(workflow definition &
execution)
These are all open-source:
MPComplete on Materials Project works as a simple
“one-click DFT”
16
Input generation
(parameter choice)
Workflow mapping Supercomputer
submission /
monitoring
Error
handling
File Transfer
File Parsing /
DB insertion
Custom material
Submit!
www.materialsproject.org
“Crystal Toolkit”
Anyone can find, edit,
and submit (suggest)
structures
Currently, this feature is available for:
•  structure optimization
•  band structures
•  elastic tensors
•  about ~10 more in Python interface
MPComplete on Materials Project works as a simple
“one-click DFT”
17
Input generation
(parameter choice)
Workflow mapping Supercomputer
submission /
monitoring
Error
handling
File Transfer
File Parsing /
DB insertion
Custom material
Submit!
www.materialsproject.org
“Crystal Toolkit”
Anyone can find, edit,
and submit (suggest)
structures
Currently, this feature is available for:
•  structure optimization
•  band structures
•  elastic tensors
•  about ~10 more in Python interface
One can also use the same
infrastructure to conduct
customized research studies via a
Python interface that provides
access to high-level operations
Workflow parameters can be customized at
multiple levels of detail
18
1.  Workflows have
various high-level
options
2. Fireworks also
have options / flags
(not shown)
3. Firetasks have
most detailed
number of options /
flags (not shown)
Example 1: “VASP input set”
controls the rules that set DFT
parameters (pseudopotentials,
cutoffs, grid densities, etc) via
pymatgen!
!
Example II: If “stability_check” is
enabled, the later parts of the
workflow are skipped if the
structure is determined unstable to
save computer time on
uninteresting structures!
You can build workflows from scratch or reuse
components to assemble workflows
Multiple workflows are built with the same components
stacked together in different ways like Legos
19
These two workflows reuse almost
all the same code between the
two!
Software allows you to leverage the prior efforts and
knowledge of many researchers past + present
20
K. Mathew J. Montoya S. Dwaraknath A. Faghaninia
All past and present knowledge, from everyone in the group,
everyone previously in the group, and our collaborators,
about how to run calculations
M. Aykol
S.P. Ong
B. Bocklund T. Smidt
H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood
Z.K. Liu J. Neaton K. Persson A. Jain
+
Talk outline
•  What we did
•  How we did it
•  Things that worked for us
21
Things that worked for us (1) - BDFLs
•  At first, we tried to make every coding decision by committee –
e.g., get all the developers to sit in a room and agree on a solution
•  Later, we assigned a strong BDFL (benevolent dictator for life)
for each codebase that would consider all options but could
simply make decisions on behalf of that codebase
•  We found it that, even though the BDFL was not always right, we
were able to progress much faster, much better, and surprisingly
with much less conflict than the old committee way
•  Note: If you were BDFL of a codebase, you got to do things your
way. But you were also signing up for a ton of extra work for that
privilege. Thus, BDFLs must care a lot about the code, be very
detail oriented, and be willing to work overtime. Not everyone is a
candidate!
22
Things that worked for us (2) – forced collaboration
•  The tendency for most scientists, at least at first, is to
write their own individual scripts their own corner
•  At first, it was needed to have a strong authority figure
(i.e., center lead) force collaboration.
–  “All code must go in pymatgen!” – Kristin Persson
•  When the code builds enough momentum and is big /
established enough, forced collaboration can be
dropped and researchers naturally put code there.
23
Things that worked for us (3) - MongoDB
•  When most people think databases, they think “SQL”
–  We were also of that mentality from 2006-2011
•  We built a beautiful, intricate schema (database blueprint)
for simulation data that was a wonder to behold
–  But, only the “database master” really knew how to modify /
expand it
–  Any time a new type of data needed to be included in the
database, the “database master” had to design schema updates
•  A computer science colleague though we might want to
experiment with MongoDB
•  Result: we can move so much faster with MongoDB due to
its flexibility and easy learning curve.
–  These days, we don’t really use SQL for anything.
24
Things that worked for us (4) – day 1 open source
•  Early in the project, we felt there was commercial and
“research advantage” value in all our automation software
–  “Let’s release open source in the future, when the code is cleaner
and also we finished getting our own research mileage out of it” –
Materials Project, circa 2011
•  One BDFL experimented with day 1 open-source for a new
and experimental code that rewrote a major, closed-source
legacy Java code in Python
–  That code, pymatgen, grew very quickly and displaced the old
legacy code in record time. It’s been cited ~300 times in just 4
years since publication!
•  Today all our codes are open source from day 1
–  Incidentally, if we are not open source from day 1, we almost never
see the code become open source. The “clean it up and release as
open source later” never works for us.
25
Thank you!
•  Prof. Kristin Persson and Prof. Gerbrand Ceder,
founders of Materials Project and their teams
•  Prof. Shyue Ping Ong, pymatgen BDFL
•  NERSC computing center and staff
•  Funding: U.S. Department of Energy
•  …and everyone who contributed to these codes!!
26
Slides (already) posted to: http://www.slideshare.net/anubhavster

Más contenido relacionado

La actualidad más candente

Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 

La actualidad más candente (20)

Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...The Materials Project: An Electronic Structure Database for Community-Based M...
The Materials Project: An Electronic Structure Database for Community-Based M...
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
The Materials Project: overview and infrastructure
The Materials Project: overview and infrastructureThe Materials Project: overview and infrastructure
The Materials Project: overview and infrastructure
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
High-throughput computation and machine learning methods applied to materials...
High-throughput computation and machine learning methods applied to materials...High-throughput computation and machine learning methods applied to materials...
High-throughput computation and machine learning methods applied to materials...
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 

Similar a The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 

Similar a The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers (20)

Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...
 
Big Data: the weakest link
Big Data: the weakest linkBig Data: the weakest link
Big Data: the weakest link
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
Resonance Introduction at SacPy
Resonance Introduction at SacPyResonance Introduction at SacPy
Resonance Introduction at SacPy
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
NANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials designNANO266 - Lecture 12 - High-throughput computational materials design
NANO266 - Lecture 12 - High-throughput computational materials design
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 

Más de Anubhav Jain

Más de Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 

Último

Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Último (20)

9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 

The Materials Project: Experiences from running a million computational science simulations and sharing the results with tens of thousands of researchers

  • 1. The Materials Project: Experiences from running a million computational materials science simulations and sharing the results with tens of thousands of researchers Anubhav Jain Energy Technologies Area Lawrence Berkeley National Lab Berkeley, CA MolSSI workflow workshop Slides (already) posted to: http://www.slideshare.net/anubhavster Input file flags SLURM format how to fix ZPOTRF? q  set up the structure coordinates q  write input files, double-check all the flags q  copy to supercomputer q  submit job to queue q  deal with supercomputer headaches q  monitor job q  fix error jobs, resubmit to queue, wait again q  repeat process for subsequent calculaJons in workflow q  parse output files to obtain results q  copy and organize results, e.g., into Excel
  • 2. Talk outline •  What we did •  How we did it •  Things that worked for us 2
  • 3. Materials development is a key bottleneck for new technologies 3 Si for solar cells since 1950s graphite + Li{Co,Mn,Ni}O2 for batteries since 1990 Technologies are often limited by the properties of their component materials, but take decades to discover and about 20 years to commercialize How can we find new materials more quickly & reliably?
  • 4. Today, one can calculate many materials properties from scratch with density functional theory (DFT) 4 A. Jain, Y. Shin, and K. A. Persson, Nat. Rev. Mater. 1, 15004 (2016).
  • 5. High-throughput DFT uses supercomputers to calculate the properties of tens of thousands of materials 5 Automate the DFT procedure Supercomputing Power FireWorks Software for programming general computational workflows that can be scaled across large supercomputers. NERSC Supercomputing center, processor count is ~100,000 desktop machines. Other centers are also viable. High-throughput materials screening G. Ceder & K.A. Persson, Scientific American (2015)
  • 6. What we did •  We started with known databases of chemical compositions, for which the crystal structure was known but the properties of the material were unknown •  We ran density functional theory simulations to predict the properties of those materials (~65,000 compounds) •  We put the results online on a site called “The Materials Project” •  We built APIs to the data and released our software stack for generating new data 6
  • 7. Materials Project database •  Online resource of density functional theory simulation data for ~65,000 inorganic materials •  Over 35,000 registered users –  we also published a review paper showing how people used the database to solve real research problems •  Includes band structures, elastic tensors, piezoelectric tensors, battery properties and more •  RESTful API •  www.materialsproject.org – (free) 7 Jain et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 11002 (2013).! Jain, A., Persson, K. A. & Ceder, G. Research Update: The materials genome initiative: Data sharing and the impact of collaborative ab initio databases. APL Mater. 4, 53102 (2016).!
  • 8. Many “largest ever” data sets – efforts combined are >1 million DFT simulations! 8 M. de Jong, W. Chen, H. Geerlings, M. Asta, and K. A. Persson, Sci. Data, 2015, 2, 150053.! M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, and M. Asta, Sci. Data, 2015, 2, 150009.! >4500 elastic tensors >900 piezoelectric tensors >48000 electronic transport Ricci, Chen, Aydemir, Snyder, Rignanese, Jain, & Hautier, Sci Data 2017, 4, 170085.! R. Tran, Z. Xu, B. Radhakrishnan, D. Winston, W. Sun, K. A. Persson, and S. P. Ong, Sci. Data, 2016, 3, 160080.! >150 Wulff shapes + surface characterizations
  • 9. Talk outline •  What we did •  How we did it •  Things that worked for us 9
  • 10. The web site is the tip of the iceberg – we’ve built and released an entire software stack underlying the effort 10 pymatgen FireWorks custodian atomate REST API
  • 11. A “black-box” view of performing a calculation 11 “something”! Results!! researcher! What is the GGA-PBE elasJc tensor of GaAs?
  • 12. Unfortunately, the inside of the “black box” is usually tedious and “low-level” 12 lots of tedious, low-level work…! Results!! researcher! What is the GGA-PBE elasJc tensor of GaAs? Input file flags SLURM format how to fix ZPOTRF? q  set up the structure coordinates q  write input files, double-check all the flags q  copy to supercomputer q  submit job to queue q  deal with supercomputer headaches q  monitor job q  fix error jobs, resubmit to queue, wait again q  repeat process for subsequent calculaJons in workflow q  parse output files to obtain results q  copy and organize results, e.g., into Excel
  • 13. What would be a better way? 13 “something”! Results!! researcher! What is the GGA-PBE elasJc tensor of GaAs?
  • 14. What would be a better way? 14 Results!! researcher! What is the GGA-PBE elasJc tensor of GaAs? a button!
  • 15. We built software for automatically doing calculations 15 (automatic materials science workflows) Custodian (calculation error recovery) (materials analysis framework) Base packages Derived packages (workflow definition & execution) These are all open-source:
  • 16. MPComplete on Materials Project works as a simple “one-click DFT” 16 Input generation (parameter choice) Workflow mapping Supercomputer submission / monitoring Error handling File Transfer File Parsing / DB insertion Custom material Submit! www.materialsproject.org “Crystal Toolkit” Anyone can find, edit, and submit (suggest) structures Currently, this feature is available for: •  structure optimization •  band structures •  elastic tensors •  about ~10 more in Python interface
  • 17. MPComplete on Materials Project works as a simple “one-click DFT” 17 Input generation (parameter choice) Workflow mapping Supercomputer submission / monitoring Error handling File Transfer File Parsing / DB insertion Custom material Submit! www.materialsproject.org “Crystal Toolkit” Anyone can find, edit, and submit (suggest) structures Currently, this feature is available for: •  structure optimization •  band structures •  elastic tensors •  about ~10 more in Python interface One can also use the same infrastructure to conduct customized research studies via a Python interface that provides access to high-level operations
  • 18. Workflow parameters can be customized at multiple levels of detail 18 1.  Workflows have various high-level options 2. Fireworks also have options / flags (not shown) 3. Firetasks have most detailed number of options / flags (not shown) Example 1: “VASP input set” controls the rules that set DFT parameters (pseudopotentials, cutoffs, grid densities, etc) via pymatgen! ! Example II: If “stability_check” is enabled, the later parts of the workflow are skipped if the structure is determined unstable to save computer time on uninteresting structures!
  • 19. You can build workflows from scratch or reuse components to assemble workflows Multiple workflows are built with the same components stacked together in different ways like Legos 19 These two workflows reuse almost all the same code between the two!
  • 20. Software allows you to leverage the prior efforts and knowledge of many researchers past + present 20 K. Mathew J. Montoya S. Dwaraknath A. Faghaninia All past and present knowledge, from everyone in the group, everyone previously in the group, and our collaborators, about how to run calculations M. Aykol S.P. Ong B. Bocklund T. Smidt H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood Z.K. Liu J. Neaton K. Persson A. Jain +
  • 21. Talk outline •  What we did •  How we did it •  Things that worked for us 21
  • 22. Things that worked for us (1) - BDFLs •  At first, we tried to make every coding decision by committee – e.g., get all the developers to sit in a room and agree on a solution •  Later, we assigned a strong BDFL (benevolent dictator for life) for each codebase that would consider all options but could simply make decisions on behalf of that codebase •  We found it that, even though the BDFL was not always right, we were able to progress much faster, much better, and surprisingly with much less conflict than the old committee way •  Note: If you were BDFL of a codebase, you got to do things your way. But you were also signing up for a ton of extra work for that privilege. Thus, BDFLs must care a lot about the code, be very detail oriented, and be willing to work overtime. Not everyone is a candidate! 22
  • 23. Things that worked for us (2) – forced collaboration •  The tendency for most scientists, at least at first, is to write their own individual scripts their own corner •  At first, it was needed to have a strong authority figure (i.e., center lead) force collaboration. –  “All code must go in pymatgen!” – Kristin Persson •  When the code builds enough momentum and is big / established enough, forced collaboration can be dropped and researchers naturally put code there. 23
  • 24. Things that worked for us (3) - MongoDB •  When most people think databases, they think “SQL” –  We were also of that mentality from 2006-2011 •  We built a beautiful, intricate schema (database blueprint) for simulation data that was a wonder to behold –  But, only the “database master” really knew how to modify / expand it –  Any time a new type of data needed to be included in the database, the “database master” had to design schema updates •  A computer science colleague though we might want to experiment with MongoDB •  Result: we can move so much faster with MongoDB due to its flexibility and easy learning curve. –  These days, we don’t really use SQL for anything. 24
  • 25. Things that worked for us (4) – day 1 open source •  Early in the project, we felt there was commercial and “research advantage” value in all our automation software –  “Let’s release open source in the future, when the code is cleaner and also we finished getting our own research mileage out of it” – Materials Project, circa 2011 •  One BDFL experimented with day 1 open-source for a new and experimental code that rewrote a major, closed-source legacy Java code in Python –  That code, pymatgen, grew very quickly and displaced the old legacy code in record time. It’s been cited ~300 times in just 4 years since publication! •  Today all our codes are open source from day 1 –  Incidentally, if we are not open source from day 1, we almost never see the code become open source. The “clean it up and release as open source later” never works for us. 25
  • 26. Thank you! •  Prof. Kristin Persson and Prof. Gerbrand Ceder, founders of Materials Project and their teams •  Prof. Shyue Ping Ong, pymatgen BDFL •  NERSC computing center and staff •  Funding: U.S. Department of Energy •  …and everyone who contributed to these codes!! 26 Slides (already) posted to: http://www.slideshare.net/anubhavster