Unleash Your Potential - Namagunga Girls Coding Club
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
1. DATA AS A SERVICE
How Web APIs and Data-Centric
Tools Power the Materials Project
Shreyas Cholia (scholia@lbl.gov)
Dan Gunter (dkgunter@lbl.gov)
Lawrence Berkeley National Laboratory
PyData 2013
2. Outline
• Data driven science
• Materials Project Overview
• Open data and APIs
• Dropping APIs on your data
• Things to think about in your API
• Writing libraries for code AND data (pymatgen REST
interface)
• Science stories to back this up
• Ipython notebook demo
3. About Us
• Dan and Shreyas are Computer Scientists/Engineers at
Berkeley Lab.
• We work with science teams to help build software and
computing infrastructure that facilitates awesome
SCIENCE
4. Science
• Science is now a collaborative effort
• Large teams of people
• Lots of computational power
6. Big Data
Science is
increasingly data-
driven
Computational
cycles are cheap
Take an –omics
approach to
science
Compute all
interesting things
first, ask questions
later
7. The –omics approach
• Instead of trying to derive a solution and compute the
results, just compute the space of all possibilities and look
for the optimal result in there.
• OK – so we are generating more data than we know what
to do with but that is ok
• (and might be a topic for another talk …)
8. An open science initiative that makes available
a huge database of computed materials
properties for all materials researchers.
The Materials Project
Wordcloud showing
frequencies of elements
in Materials Project's
database
..except Oxygen, which appears
12,751 times (3.5x as much as the
next most frequent, Phosphorus)
10. 18 years
from creation
to commercial
manufacture!
Teflon
Titanium
Velcro
Polycarbonate
GaAs
Diamond-like Thin
Films
Materials Data from: Eagar, T.; King, M. Technology Review (00401692) 1995, 98, 42.
invented
1960 19701950
"Need for speed" in new materials
Lithium ion
S. Whittingham
Sony
1980 1990 2000
11. Materials have strategic importance
Sept 7, 2010
Japan arrests
Chinese boat captain
after collision in
disputed waters
China blocks
shipments of Rare
Earth Metals to
Japan
Sept 22, 2010
Japan releases
captain
Sept 24, 2010
Japan invests in induction motors… coincidence?
“Toyota Readying Motors That Don’t Use Rare Earths…”
Jan 14, 2011 1:50 PM PT
Content for this slide courtesy Gerbrand Ceder, MIT & Kristin Persson, LBNL
2010 "Senkaku Boat Collision Incident"
12. Solution: Computation
Many materials properties can be computed
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Voltage
(V)
computed experimental
literature
stage I+II
Stage
II
Stage I
stage III
+II
+ =
ΔH = [ E (X) + E (Y) ] –
E(XY)
Photovoltaics, Thermoelectrics,
Energy Storage, Hydrogen,
Catalysts, Magnets….
14. 0
10000
20000
30000
40000
50000
60000
Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Oct 2012
Date
Numberofruns
state
Failed
Successful
Computation
• Run VASP on NERSC
supercomputing
resources
• Use Fireworks to
manage large groups
of runs
• Results in … data for
MP
Just sit back and
enjoy the
automation..
Total
15. Data Islands
• Data is still heavily silo-ed and inaccessible.
• Data sits on a machine somewhere, and you give people
local ssh/DB accounts to access it.
• Good luck combining multiple datasets
• Does not scale!
• This is 2013 – we can do better!
16. Sharing (your data) is important!
• The Most Important Scientific Result Published in the
Last Year
J.M. Wicherts, M. Bakker, and D. Molenaar:
Willingness to Share Research Data Is Related to the
Strength of the Evidence and the Quality of Reporting of
Statistical Results
PLoS ONE, 6(11): e26828, 2011, doi:10.1371/journal.pone.0026828.
Content for this slide courtesy Greg Wilson, Software Carpentry
17. Data Sharing
• Open access to data through programmatic interfaces
• Sub-select the data on demand rather than pulling down
the entire dataset
• Use your own local tools with centrally managed data
• Everyone sees the same data – better collaboration
18. Web portal
• Materials data stored in a Mongo DB
• http://materialsproject.org web portal makes materials data
easily accessible
• Materials Explorer
• Phase Diagrams
• Crystal Toolkit
• Battery Explorer
• Reaction Calculator
• Structure Predictor
• Focus on a highly functional and usable website to query
materials data. (We heart Django!)
• Additionally we distribute the tools used to compute and
analyze the data as an open source library – pymatgen
19. API access
• But we quickly found that scientists wanted programmatic
access to data
• eg. Give me property X for all materials with Li and O so
that I can pass it through my own codes
• Lesson – make your data available through an API and
people will start to do amazing things
20. Why Web APIs?
• Big push towards HTTP APIs across the web.
• Web APIs provide programmatic access to data and
resources to developers over the web
• Access to data as well-defined objects allows users to
develop their own custom applications and code
Enables a thriving COMMUNITY built around data.
21. What is The Materials API?
An open platform for
accessing Materials
Project data over the
web.
Flexible and scalable to
cater to large number
of collaborators, with
different access
privileges.
Simple to use and code
agnostic.
23. Access via an API key
• To maintain privileged access, each user has an
associated API key (with certain defined access
privileges).
• To get your key, login to materialsproject.org and go to
www.materialsproject.org/profile
• All MP https requests must supply API key as:
• A x-api-key header, e.g., {‘X-API-KEY’: ‘MYKEY’}, or
• As a GET or POST variable, e.g., {‘API_KEY’: ‘MYKEY’}
24. Sample JSON output
GET https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy
{
"created_at": "2013-03-17T09:14:58.158081",
"valid_response": true,
"version": {
"pymatgen": "2.5.4",
"db": "2013.02.25",
"rest": "1.0"
},
"response": [{
"energy": -132.33005625,
"material_id": 542309
}, {
"energy": -66.62512425,
"material_id": 24972
}],
"copyright": "Copyright 2012, The Materials Project"
}
Just the energy and the
id of the material
26. Our dirty little secret
• It involves a certain language that ends with “uby” that we
don’t like to talk about in these parts
• Version 0.0.0 was of the Materials Project was coded in
Sinatra
• Sinatra is a microframework much like Flask
• But it proves that this approach is viable and can be the
onramp to more amazing things.
27. Un-considerations
• Don’t worry too much about pure REST
• Initially just think of how URLs and verbs can map to functions
• Don’t worry too much about data formats
• JSON is easy and a great place to start
• Feel free to avoid XML unless you really need it
29. pymatgen
• The open source python library that powers the Materials
Project.
• Defines core Python objects for materials data representation.
• Provides a well-tested set of structure and thermodynamic analysis
tools relevant to many applications.
• Establishes an open platform for researchers to collaboratively
develop sophisticated analyses of materials data obtained both
from first principles calculations and experiments.
31. Where we’re going with this
• Libraries that integrate data with computation!
• The scientific python ecosystem has a ton of data analysis
tools and libraries
• Just starting to think about baking in datasets directly into
these tools
• Pymatgen allows you to access core MP data directly
from the library
32. Compute + data
pymatgen has hooks into the materials data so you can do
stuff like this:
entries = api.get_entries_in_chemsys([’Li', ‘Fe', 'O'])
But it also has computational tools that you can then use to
act on the data
pd = PhaseDiagram(entries)
33. Blurring the lines
• Yes – we are blurring the lines between compute and data
• But this is not a new idea
• Think of all the tools built around commercial APIs
• Twitter, Netflix etc. - python clients built around the API
34. Write First Class Science Functions
• Web APIs are extremely useful, but ultimately you want to
encapsulate core science functionality as python functions
so that scientists aren’t worrying about things like
How do I set the
X-API-KEY header?
35. Sample use cases
• Screening for CO2 sorbents (with Clare Grey)
• Using the Materials API (MAPI) + pymatgen to calculate reaction
energies of thousands of oxides with CO2.
• Calculation of XAFS, XANES and other spectra for
clusters of atoms (with Alan Dozier)
• Alan wrote a io add-on to pymatgen for FEFF input/output.
• Uses MAPI + pymatgen to extract structures.
• Defects (with Maciej Haranczyk)
• Uses MAPI + pymatgen to pull structures to perform Voronoi
analysis to find possible interstitial sites.
37. from pymatgen.matproj.rest import MPRester
#This initializes the REST adaptor. Put your own API key in.
a = MPRester("YOUR_API_KEY")
#This gives you the Structure corresponding to material id 2254 in
the Materials Project.
structure = a.get_structure_by_material_id(2254)
#Entries are the basic unit for thermodynamic and other analyses
in pymatgen.
#This gets all entries belonging to the Ca-C-O system.
entries = a.get_entries_in_chemsys(['Ca', 'C', 'O'])
#With entries, you can do many sophisticated analyses,
#like creating phase diagrams.
pd = PhaseDiagram(entries)
plotter = PDPlotter(pd)
plotter.show()
Materials API + pymatgen example
38. Sandboxes
• A virtual private dataset
• Useful for
• Everyone as a sort of "scratch"
space
• Industry partners who want to use
the tools but not share their data
39. Import format: Structure Notation
Language (SNL)
• Contains structure/molecule object, and provenance
about
created_at
authors
projects
references
remarks
data
history
Another way to remember the acronym..
40. Fireworks
• FireWorks is a code for defining, managing, and executing
scientific workflows
• It can be used to automate most types of calculations over
arbitrary computing resources, including those that have a
queueing system
• It is very dynamic: Fireworks can begat other fireworks at
runtime
http://pythonhosted.org/FireWorks/
41. Pymatgen-db
• Sick of MongoHub et al.? We were. So we wrote a simple
Web UI using prettytable, pymatgen, and Django
• https://github.com/materialsproject/pymatgen-db
Which we
proceeded to
use for deep
scientific inquiry
42. We’re not the only ones …
• Bioinformatics
• KBase (http://kbase.us) - DOE predictive and systems biology.
• Astronomy
• Sloan Digital Sky Survey (http://skyserver.sdss.org)
• Spectroscopy
• Advanced Light Source (ALS), Advanced Photon Source (APS)
• According to ProgrammableWeb, ~130 others
http://www.programmableweb.com/apis/directory/1?apicat=Science&protocol=REST
..though probably many of these are
43. More information
• Materials API + pymatgen examples
• https://gist.github.com/gists/search?q=materials+api+pymatgen
• The Materials API wiki
• https://materialsproject.org/wiki/index.php/The_Materials_API
• Python Materials Genomics
• http://packages.python.org/pymatgen/
• Shyue Ping Ong, William Davidson Richard, Anubhav Jain, Geoffroy
Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent
Chevrier, Kristin A. Persson, Gerbrand Ceder. Python Materials
Genomics (pymatgen) : A Robust, Open-Source Python Library for
Materials Analysis. (submitted)
• These slides:
• https://speakerdeck.com/shreddd/data-as-a-service-pydata-2013
44. Takeaways
• Make scientific data easily available to end-users
• Friendly, powerful Web UI is a great way to engage, but then..
• Build APIs around your data to make it easily accessible
• Write scientific libraries with *both* analysis and data, by
hooking them up to APIs.
45. We’re hiring
• Talented, science-loving, web-savvy, math-anything
Python programming code-slingers who would rather pass
a Nobel prize winner on the way to lunch than get free
dry-cleaning
• downside: or even free coffee (groan)
• upside: some of your tax dollars go towards your own salary!
• http://jobs.materialsproject.org/
46. Contact Us
• Shreyas Cholia – scholia@lbl.gov
• Dan Gunter – dkgunter@lbl.gov
• Materials Project Team – feedback@materialsproject.org