SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
DATA AS A SERVICE
How Web APIs and Data-Centric
Tools Power the Materials Project
Shreyas Cholia (scholia@lbl.gov)
Dan Gunter (dkgunter@lbl.gov)
Lawrence Berkeley National Laboratory
PyData 2013
Outline
•  Data driven science
•  Materials Project Overview
•  Open data and APIs
•  Dropping APIs on your data
•  Things to think about in your API
•  Writing libraries for code AND data (pymatgen REST
interface)
•  Science stories to back this up
•  Ipython notebook demo
About Us
•  Dan and Shreyas are Computer Scientists/Engineers at
Berkeley Lab.
•  We work with science teams to help build software and
computing infrastructure that facilitates awesome
SCIENCE
Science
•  Science is now a collaborative effort
•  Large teams of people
•  Lots of computational power
The Fourth Paradigm
Big Data
Science is
increasingly data-
driven
Computational
cycles are cheap
Take an –omics
approach to
science
Compute all
interesting things
first, ask questions
later
The –omics approach
•  Instead of trying to derive a solution and compute the
results, just compute the space of all possibilities and look
for the optimal result in there.
•  OK – so we are generating more data than we know what
to do with but that is ok
•  (and might be a topic for another talk …)
An open science initiative that makes available
a huge database of computed materials
properties for all materials researchers.
The Materials Project
Wordcloud showing
frequencies of elements
in Materials Project's
database
..except Oxygen, which appears
12,751 times (3.5x as much as the
next most frequent, Phosphorus)
The Materials Project http://materialsproject.org/
18 years
from creation
to commercial
manufacture!
Teflon
Titanium
Velcro
Polycarbonate
GaAs
Diamond-like Thin
Films
Materials Data from: Eagar, T.; King, M. Technology Review (00401692) 1995, 98, 42.
invented
1960 19701950
"Need for speed" in new materials
Lithium ion
S. Whittingham
Sony
1980 1990 2000
Materials have strategic importance
Sept 7, 2010
Japan arrests
Chinese boat captain
after collision in
disputed waters
China blocks
shipments of Rare
Earth Metals to
Japan
Sept 22, 2010
Japan releases
captain
Sept 24, 2010
Japan invests in induction motors… coincidence?
“Toyota Readying Motors That Don’t Use Rare Earths…”
Jan 14, 2011 1:50 PM PT
Content for this slide courtesy Gerbrand Ceder, MIT & Kristin Persson, LBNL
2010 "Senkaku Boat Collision Incident"
Solution: Computation
Many materials properties can be computed
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Voltage	
  (V)
computed experimental	
  literature
stage I+II
Stage
II
Stage I
stage III
+II
+ =
ΔH = [ E (X) + E (Y) ] –
E(XY)
Photovoltaics, Thermoelectrics,
Energy Storage, Hydrogen,
Catalysts, Magnets….
Infrastructure
Submitted
Materials
Materials
Data
Materials Properties
Supercomputers	
  
•  Over 10 million CPU hours of
calculations in < 6 months
•  Over 40,000 successful VASP
runs
(30,000+ materials)
•  Generalizable to other high-
throughput codes/analyses
Calculation
Workflows
Supercomputers Codes to run
(in sequence)
Atomic positions
0
10000
20000
30000
40000
50000
60000
Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Oct 2012
Date
Numberofruns
state
Failed
Successful
Computation
•  Run VASP on NERSC
supercomputing
resources
•  Use Fireworks to
manage large groups
of runs
•  Results in … data for
MP
Just sit back and
enjoy the
automation..
Total
Data Islands
•  Data is still heavily silo-ed and inaccessible.
•  Data sits on a machine somewhere, and you give people
local ssh/DB accounts to access it.
•  Good luck combining multiple datasets
•  Does not scale!
•  This is 2013 – we can do better!
Sharing (your data) is important!
•  The Most Important Scientific Result Published in the
Last Year
J.M. Wicherts, M. Bakker, and D. Molenaar:
Willingness to Share Research Data Is Related to the
Strength of the Evidence and the Quality of Reporting of
Statistical Results
PLoS ONE, 6(11): e26828, 2011, doi:10.1371/journal.pone.0026828.
Content for this slide courtesy Greg Wilson, Software Carpentry
Data Sharing
•  Open access to data through programmatic interfaces
•  Sub-select the data on demand rather than pulling down
the entire dataset
•  Use your own local tools with centrally managed data
•  Everyone sees the same data – better collaboration
Web portal
•  Materials data stored in a Mongo DB
•  http://materialsproject.org web portal makes materials data
easily accessible
•  Materials Explorer
•  Phase Diagrams
•  Crystal Toolkit
•  Battery Explorer
•  Reaction Calculator
•  Structure Predictor
•  Focus on a highly functional and usable website to query
materials data. (We heart Django!)
•  Additionally we distribute the tools used to compute and
analyze the data as an open source library – pymatgen
API access
•  But we quickly found that scientists wanted programmatic
access to data
•  eg. Give me property X for all materials with Li and O so
that I can pass it through my own codes
•  Lesson – make your data available through an API and
people will start to do amazing things
Why Web APIs?
•  Big push towards HTTP APIs across the web.
•  Web APIs provide programmatic access to data and
resources to developers over the web
•  Access to data as well-defined objects allows users to
develop their own custom applications and code
Enables a thriving COMMUNITY built around data.
What is The Materials API?
An open platform for
accessing Materials
Project data over the
web.
Flexible and scalable to
cater to large number
of collaborators, with
different access
privileges.
Simple to use and code
agnostic.
HTTP API design
https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy
Preamble URL
Unique Identifier.
Eg. a formula
(Fe2O3), id (1234)
or chemical
system (Li-Fe-O)
Data type
(vasp,
exp, etc.)
Propert
y
Materials API maps URLs to data
objects
Access via an API key
•  To maintain privileged access, each user has an
associated API key (with certain defined access
privileges).
•  To get your key, login to materialsproject.org and go to
www.materialsproject.org/profile
•  All MP https requests must supply API key as:
•  A x-api-key header, e.g., {‘X-API-KEY’: ‘MYKEY’}, or
•  As a GET or POST variable, e.g., {‘API_KEY’: ‘MYKEY’}
Sample JSON output
GET https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy
{
"created_at": "2013-03-17T09:14:58.158081",
"valid_response": true,
"version": {
"pymatgen": "2.5.4",
"db": "2013.02.25",
"rest": "1.0"
},
"response": [{
"energy": -132.33005625,
"material_id": 542309
}, {
"energy": -66.62512425,
"material_id": 24972
}],
"copyright": "Copyright 2012, The Materials Project"
}
Just the energy and the
id of the material
Getting started – Hello World API
> pip install flask
Our dirty little secret
•  It involves a certain language that ends with “uby” that we
don’t like to talk about in these parts
•  Version 0.0.0 was of the Materials Project was coded in
Sinatra
•  Sinatra is a microframework much like Flask
•  But it proves that this approach is viable and can be the
onramp to more amazing things.
Un-considerations
•  Don’t worry too much about pure REST
•  Initially just think of how URLs and verbs can map to functions
•  Don’t worry too much about data formats
•  JSON is easy and a great place to start
•  Feel free to avoid XML unless you really need it
Our Stack
•  Apache + mod_wsgi
•  Django
•  pymatgen
•  pymongo + Mongo DB
pymatgen
•  The open source python library that powers the Materials
Project.
•  Defines core Python objects for materials data representation.
•  Provides a well-tested set of structure and thermodynamic analysis
tools relevant to many applications.
•  Establishes an open platform for researchers to collaboratively
develop sophisticated analyses of materials data obtained both
from first principles calculations and experiments.
Integration with pymatgen
The Materials API
Powerful Materials
Analytics Tool
Where we’re going with this
•  Libraries that integrate data with computation!
•  The scientific python ecosystem has a ton of data analysis
tools and libraries
•  Just starting to think about baking in datasets directly into
these tools
•  Pymatgen allows you to access core MP data directly
from the library
Compute + data
pymatgen has hooks into the materials data so you can do
stuff like this:
entries = api.get_entries_in_chemsys([’Li', ‘Fe', 'O'])
But it also has computational tools that you can then use to
act on the data
pd = PhaseDiagram(entries)
Blurring the lines
•  Yes – we are blurring the lines between compute and data
•  But this is not a new idea
•  Think of all the tools built around commercial APIs
•  Twitter, Netflix etc. - python clients built around the API
Write First Class Science Functions
•  Web APIs are extremely useful, but ultimately you want to
encapsulate core science functionality as python functions
so that scientists aren’t worrying about things like
How do I set the
X-API-KEY header?
Sample use cases
•  Screening for CO2 sorbents (with Clare Grey)
•  Using the Materials API (MAPI) + pymatgen to calculate reaction
energies of thousands of oxides with CO2.
•  Calculation of XAFS, XANES and other spectra for
clusters of atoms (with Alan Dozier)
•  Alan wrote a io add-on to pymatgen for FEFF input/output.
•  Uses MAPI + pymatgen to extract structures.
•  Defects (with Maciej Haranczyk)
•  Uses MAPI + pymatgen to pull structures to perform Voronoi
analysis to find possible interstitial sites.
Ipython Notebook Examples
•  http://nbviewer.ipython.org/5199610
•  http://nbviewer.ipython.org/5022735
from pymatgen.matproj.rest import MPRester
#This initializes the REST adaptor. Put your own API key in.
a = MPRester("YOUR_API_KEY")
 
#This gives you the Structure corresponding to material id 2254 in
the Materials Project.
structure = a.get_structure_by_material_id(2254)
 
#Entries are the basic unit for thermodynamic and other analyses
in pymatgen.
#This gets all entries belonging to the Ca-C-O system.
entries = a.get_entries_in_chemsys(['Ca', 'C', 'O'])
#With entries, you can do many sophisticated analyses,
#like creating phase diagrams.
pd = PhaseDiagram(entries)
plotter = PDPlotter(pd)
plotter.show()
Materials API + pymatgen example
Sandboxes
•  A virtual private dataset
•  Useful for
•  Everyone as a sort of "scratch"
space
•  Industry partners who want to use
the tools but not share their data
Import format: Structure Notation
Language (SNL)
•  Contains structure/molecule object, and provenance
about
created_at
authors
projects
references
remarks
data
history
Another way to remember the acronym..
Fireworks
•  FireWorks is a code for defining, managing, and executing
scientific workflows
•  It can be used to automate most types of calculations over
arbitrary computing resources, including those that have a
queueing system
•  It is very dynamic: Fireworks can begat other fireworks at
runtime
http://pythonhosted.org/FireWorks/
Pymatgen-db
•  Sick of MongoHub et al.? We were. So we wrote a simple
Web UI using prettytable, pymatgen, and Django
•  https://github.com/materialsproject/pymatgen-db
Which we
proceeded to
use for deep
scientific inquiry
We’re not the only ones …
•  Bioinformatics
•  KBase (http://kbase.us) - DOE predictive and systems biology.
•  Astronomy
•  Sloan Digital Sky Survey (http://skyserver.sdss.org)
•  Spectroscopy
•  Advanced Light Source (ALS), Advanced Photon Source (APS)
•  According to ProgrammableWeb, ~130 others
http://www.programmableweb.com/apis/directory/1?apicat=Science&protocol=REST
..though probably many of these are
More information
•  Materials API + pymatgen examples
•  https://gist.github.com/gists/search?q=materials+api+pymatgen
•  The Materials API wiki
•  https://materialsproject.org/wiki/index.php/The_Materials_API
•  Python Materials Genomics
•  http://packages.python.org/pymatgen/
•  Shyue Ping Ong, William Davidson Richard, Anubhav Jain, Geoffroy
Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent
Chevrier, Kristin A. Persson, Gerbrand Ceder. Python Materials
Genomics (pymatgen) : A Robust, Open-Source Python Library for
Materials Analysis. (submitted)
•  These slides:
•  https://speakerdeck.com/shreddd/data-as-a-service-pydata-2013
Takeaways
•  Make scientific data easily available to end-users
•  Friendly, powerful Web UI is a great way to engage, but then..
•  Build APIs around your data to make it easily accessible
•  Write scientific libraries with *both* analysis and data, by
hooking them up to APIs.
We’re hiring
•  Talented, science-loving, web-savvy, math-anything
Python programming code-slingers who would rather pass
a Nobel prize winner on the way to lunch than get free
dry-cleaning
•  downside: or even free coffee (groan)
•  upside: some of your tax dollars go towards your own salary!
•  http://jobs.materialsproject.org/
Contact Us
•  Shreyas Cholia – scholia@lbl.gov
•  Dan Gunter – dkgunter@lbl.gov
•  Materials Project Team – feedback@materialsproject.org
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)

Más contenido relacionado

La actualidad más candente

Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
Webinar: Deep Learning with H2O
Webinar: Deep Learning with H2OWebinar: Deep Learning with H2O
Webinar: Deep Learning with H2OSri Ambati
 
Scalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OScalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OSri Ambati
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LASri Ambati
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and OutTravis Oliphant
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2OSri Ambati
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona KeynoteTravis Oliphant
 

La actualidad más candente (20)

Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Webinar: Deep Learning with H2O
Webinar: Deep Learning with H2OWebinar: Deep Learning with H2O
Webinar: Deep Learning with H2O
 
Scalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2OScalable Machine Learning in R and Python with H2O
Scalable Machine Learning in R and Python with H2O
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
Scaling PyData Up and Out
Scaling PyData Up and OutScaling PyData Up and Out
Scaling PyData Up and Out
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 

Similar a How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructureAnubhav Jain
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...Anubhav Jain
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
Big(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringBig(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringMehdi Mirakhorli
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAnubhav Jain
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Delivering Agile Data Science on Openshift - Red Hat Summit 2019
Delivering Agile Data Science on Openshift  - Red Hat Summit 2019Delivering Agile Data Science on Openshift  - Red Hat Summit 2019
Delivering Agile Data Science on Openshift - Red Hat Summit 2019John Archer
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23Dan Boutin
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 

Similar a How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013) (20)

Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Big(ger) Data in Software Engineering
Big(ger) Data in Software EngineeringBig(ger) Data in Software Engineering
Big(ger) Data in Software Engineering
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Delivering Agile Data Science on Openshift - Red Hat Summit 2019
Delivering Agile Data Science on Openshift  - Red Hat Summit 2019Delivering Agile Data Science on Openshift  - Red Hat Summit 2019
Delivering Agile Data Science on Openshift - Red Hat Summit 2019
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 

Más de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Más de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Último

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Último (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)

  • 1. DATA AS A SERVICE How Web APIs and Data-Centric Tools Power the Materials Project Shreyas Cholia (scholia@lbl.gov) Dan Gunter (dkgunter@lbl.gov) Lawrence Berkeley National Laboratory PyData 2013
  • 2. Outline •  Data driven science •  Materials Project Overview •  Open data and APIs •  Dropping APIs on your data •  Things to think about in your API •  Writing libraries for code AND data (pymatgen REST interface) •  Science stories to back this up •  Ipython notebook demo
  • 3. About Us •  Dan and Shreyas are Computer Scientists/Engineers at Berkeley Lab. •  We work with science teams to help build software and computing infrastructure that facilitates awesome SCIENCE
  • 4. Science •  Science is now a collaborative effort •  Large teams of people •  Lots of computational power
  • 6. Big Data Science is increasingly data- driven Computational cycles are cheap Take an –omics approach to science Compute all interesting things first, ask questions later
  • 7. The –omics approach •  Instead of trying to derive a solution and compute the results, just compute the space of all possibilities and look for the optimal result in there. •  OK – so we are generating more data than we know what to do with but that is ok •  (and might be a topic for another talk …)
  • 8. An open science initiative that makes available a huge database of computed materials properties for all materials researchers. The Materials Project Wordcloud showing frequencies of elements in Materials Project's database ..except Oxygen, which appears 12,751 times (3.5x as much as the next most frequent, Phosphorus)
  • 9. The Materials Project http://materialsproject.org/
  • 10. 18 years from creation to commercial manufacture! Teflon Titanium Velcro Polycarbonate GaAs Diamond-like Thin Films Materials Data from: Eagar, T.; King, M. Technology Review (00401692) 1995, 98, 42. invented 1960 19701950 "Need for speed" in new materials Lithium ion S. Whittingham Sony 1980 1990 2000
  • 11. Materials have strategic importance Sept 7, 2010 Japan arrests Chinese boat captain after collision in disputed waters China blocks shipments of Rare Earth Metals to Japan Sept 22, 2010 Japan releases captain Sept 24, 2010 Japan invests in induction motors… coincidence? “Toyota Readying Motors That Don’t Use Rare Earths…” Jan 14, 2011 1:50 PM PT Content for this slide courtesy Gerbrand Ceder, MIT & Kristin Persson, LBNL 2010 "Senkaku Boat Collision Incident"
  • 12. Solution: Computation Many materials properties can be computed 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Voltage  (V) computed experimental  literature stage I+II Stage II Stage I stage III +II + = ΔH = [ E (X) + E (Y) ] – E(XY) Photovoltaics, Thermoelectrics, Energy Storage, Hydrogen, Catalysts, Magnets….
  • 13. Infrastructure Submitted Materials Materials Data Materials Properties Supercomputers   •  Over 10 million CPU hours of calculations in < 6 months •  Over 40,000 successful VASP runs (30,000+ materials) •  Generalizable to other high- throughput codes/analyses Calculation Workflows Supercomputers Codes to run (in sequence) Atomic positions
  • 14. 0 10000 20000 30000 40000 50000 60000 Jul 2011 Oct 2011 Jan 2012 Apr 2012 Jul 2012 Oct 2012 Date Numberofruns state Failed Successful Computation •  Run VASP on NERSC supercomputing resources •  Use Fireworks to manage large groups of runs •  Results in … data for MP Just sit back and enjoy the automation.. Total
  • 15. Data Islands •  Data is still heavily silo-ed and inaccessible. •  Data sits on a machine somewhere, and you give people local ssh/DB accounts to access it. •  Good luck combining multiple datasets •  Does not scale! •  This is 2013 – we can do better!
  • 16. Sharing (your data) is important! •  The Most Important Scientific Result Published in the Last Year J.M. Wicherts, M. Bakker, and D. Molenaar: Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results PLoS ONE, 6(11): e26828, 2011, doi:10.1371/journal.pone.0026828. Content for this slide courtesy Greg Wilson, Software Carpentry
  • 17. Data Sharing •  Open access to data through programmatic interfaces •  Sub-select the data on demand rather than pulling down the entire dataset •  Use your own local tools with centrally managed data •  Everyone sees the same data – better collaboration
  • 18. Web portal •  Materials data stored in a Mongo DB •  http://materialsproject.org web portal makes materials data easily accessible •  Materials Explorer •  Phase Diagrams •  Crystal Toolkit •  Battery Explorer •  Reaction Calculator •  Structure Predictor •  Focus on a highly functional and usable website to query materials data. (We heart Django!) •  Additionally we distribute the tools used to compute and analyze the data as an open source library – pymatgen
  • 19. API access •  But we quickly found that scientists wanted programmatic access to data •  eg. Give me property X for all materials with Li and O so that I can pass it through my own codes •  Lesson – make your data available through an API and people will start to do amazing things
  • 20. Why Web APIs? •  Big push towards HTTP APIs across the web. •  Web APIs provide programmatic access to data and resources to developers over the web •  Access to data as well-defined objects allows users to develop their own custom applications and code Enables a thriving COMMUNITY built around data.
  • 21. What is The Materials API? An open platform for accessing Materials Project data over the web. Flexible and scalable to cater to large number of collaborators, with different access privileges. Simple to use and code agnostic.
  • 22. HTTP API design https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy Preamble URL Unique Identifier. Eg. a formula (Fe2O3), id (1234) or chemical system (Li-Fe-O) Data type (vasp, exp, etc.) Propert y Materials API maps URLs to data objects
  • 23. Access via an API key •  To maintain privileged access, each user has an associated API key (with certain defined access privileges). •  To get your key, login to materialsproject.org and go to www.materialsproject.org/profile •  All MP https requests must supply API key as: •  A x-api-key header, e.g., {‘X-API-KEY’: ‘MYKEY’}, or •  As a GET or POST variable, e.g., {‘API_KEY’: ‘MYKEY’}
  • 24. Sample JSON output GET https://www.materialsproject.org/rest/v1/materials/Fe2O3/vasp/energy { "created_at": "2013-03-17T09:14:58.158081", "valid_response": true, "version": { "pymatgen": "2.5.4", "db": "2013.02.25", "rest": "1.0" }, "response": [{ "energy": -132.33005625, "material_id": 542309 }, { "energy": -66.62512425, "material_id": 24972 }], "copyright": "Copyright 2012, The Materials Project" } Just the energy and the id of the material
  • 25. Getting started – Hello World API > pip install flask
  • 26. Our dirty little secret •  It involves a certain language that ends with “uby” that we don’t like to talk about in these parts •  Version 0.0.0 was of the Materials Project was coded in Sinatra •  Sinatra is a microframework much like Flask •  But it proves that this approach is viable and can be the onramp to more amazing things.
  • 27. Un-considerations •  Don’t worry too much about pure REST •  Initially just think of how URLs and verbs can map to functions •  Don’t worry too much about data formats •  JSON is easy and a great place to start •  Feel free to avoid XML unless you really need it
  • 28. Our Stack •  Apache + mod_wsgi •  Django •  pymatgen •  pymongo + Mongo DB
  • 29. pymatgen •  The open source python library that powers the Materials Project. •  Defines core Python objects for materials data representation. •  Provides a well-tested set of structure and thermodynamic analysis tools relevant to many applications. •  Establishes an open platform for researchers to collaboratively develop sophisticated analyses of materials data obtained both from first principles calculations and experiments.
  • 30. Integration with pymatgen The Materials API Powerful Materials Analytics Tool
  • 31. Where we’re going with this •  Libraries that integrate data with computation! •  The scientific python ecosystem has a ton of data analysis tools and libraries •  Just starting to think about baking in datasets directly into these tools •  Pymatgen allows you to access core MP data directly from the library
  • 32. Compute + data pymatgen has hooks into the materials data so you can do stuff like this: entries = api.get_entries_in_chemsys([’Li', ‘Fe', 'O']) But it also has computational tools that you can then use to act on the data pd = PhaseDiagram(entries)
  • 33. Blurring the lines •  Yes – we are blurring the lines between compute and data •  But this is not a new idea •  Think of all the tools built around commercial APIs •  Twitter, Netflix etc. - python clients built around the API
  • 34. Write First Class Science Functions •  Web APIs are extremely useful, but ultimately you want to encapsulate core science functionality as python functions so that scientists aren’t worrying about things like How do I set the X-API-KEY header?
  • 35. Sample use cases •  Screening for CO2 sorbents (with Clare Grey) •  Using the Materials API (MAPI) + pymatgen to calculate reaction energies of thousands of oxides with CO2. •  Calculation of XAFS, XANES and other spectra for clusters of atoms (with Alan Dozier) •  Alan wrote a io add-on to pymatgen for FEFF input/output. •  Uses MAPI + pymatgen to extract structures. •  Defects (with Maciej Haranczyk) •  Uses MAPI + pymatgen to pull structures to perform Voronoi analysis to find possible interstitial sites.
  • 36. Ipython Notebook Examples •  http://nbviewer.ipython.org/5199610 •  http://nbviewer.ipython.org/5022735
  • 37. from pymatgen.matproj.rest import MPRester #This initializes the REST adaptor. Put your own API key in. a = MPRester("YOUR_API_KEY")   #This gives you the Structure corresponding to material id 2254 in the Materials Project. structure = a.get_structure_by_material_id(2254)   #Entries are the basic unit for thermodynamic and other analyses in pymatgen. #This gets all entries belonging to the Ca-C-O system. entries = a.get_entries_in_chemsys(['Ca', 'C', 'O']) #With entries, you can do many sophisticated analyses, #like creating phase diagrams. pd = PhaseDiagram(entries) plotter = PDPlotter(pd) plotter.show() Materials API + pymatgen example
  • 38. Sandboxes •  A virtual private dataset •  Useful for •  Everyone as a sort of "scratch" space •  Industry partners who want to use the tools but not share their data
  • 39. Import format: Structure Notation Language (SNL) •  Contains structure/molecule object, and provenance about created_at authors projects references remarks data history Another way to remember the acronym..
  • 40. Fireworks •  FireWorks is a code for defining, managing, and executing scientific workflows •  It can be used to automate most types of calculations over arbitrary computing resources, including those that have a queueing system •  It is very dynamic: Fireworks can begat other fireworks at runtime http://pythonhosted.org/FireWorks/
  • 41. Pymatgen-db •  Sick of MongoHub et al.? We were. So we wrote a simple Web UI using prettytable, pymatgen, and Django •  https://github.com/materialsproject/pymatgen-db Which we proceeded to use for deep scientific inquiry
  • 42. We’re not the only ones … •  Bioinformatics •  KBase (http://kbase.us) - DOE predictive and systems biology. •  Astronomy •  Sloan Digital Sky Survey (http://skyserver.sdss.org) •  Spectroscopy •  Advanced Light Source (ALS), Advanced Photon Source (APS) •  According to ProgrammableWeb, ~130 others http://www.programmableweb.com/apis/directory/1?apicat=Science&protocol=REST ..though probably many of these are
  • 43. More information •  Materials API + pymatgen examples •  https://gist.github.com/gists/search?q=materials+api+pymatgen •  The Materials API wiki •  https://materialsproject.org/wiki/index.php/The_Materials_API •  Python Materials Genomics •  http://packages.python.org/pymatgen/ •  Shyue Ping Ong, William Davidson Richard, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent Chevrier, Kristin A. Persson, Gerbrand Ceder. Python Materials Genomics (pymatgen) : A Robust, Open-Source Python Library for Materials Analysis. (submitted) •  These slides: •  https://speakerdeck.com/shreddd/data-as-a-service-pydata-2013
  • 44. Takeaways •  Make scientific data easily available to end-users •  Friendly, powerful Web UI is a great way to engage, but then.. •  Build APIs around your data to make it easily accessible •  Write scientific libraries with *both* analysis and data, by hooking them up to APIs.
  • 45. We’re hiring •  Talented, science-loving, web-savvy, math-anything Python programming code-slingers who would rather pass a Nobel prize winner on the way to lunch than get free dry-cleaning •  downside: or even free coffee (groan) •  upside: some of your tax dollars go towards your own salary! •  http://jobs.materialsproject.org/
  • 46. Contact Us •  Shreyas Cholia – scholia@lbl.gov •  Dan Gunter – dkgunter@lbl.gov •  Materials Project Team – feedback@materialsproject.org