SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Reproducible Quantum Chemistry
Dr. Marcus D. Hanwell
@mhanwell
Technical Leader
American Chemical Society
Orlando, FL
31 March, 2019
What Is Open Chemistry?
● Umbrella of related projects to coordinate and group
○ Focus on 3-clause BSD permissively licensed projects
○ Aims for more complete solution
● Initially three related projects
○ Avogadro 2 - editor, visualization, interaction with small number of molecules
○ MoleQueue - running computational jobs, abstracting local and remote execution
○ MongoChem - database for interacting with many molecules, summarizing data, informatics
● Evolved over the years but still retains many of those goals
○ GitHub organization with 35 repositories at the last count
● Umbrella organization in Google Summer of Code
○ Four years, with 3, 7, 7, and TBD students over a broad range of projects
○ Hope to continue this and other community engagement activities
https://openchemistry.org/
Why Jupyter?
● Supports interactive analysis while preserving the analytic steps​
○ Preserves much of the provenance​
● Familiar environment and language​
○ Many are already familiar with the environment​
○ Python is the language of scientific computing​
● Simple extension mechanism​
○ Particularly with JupyterLab​
○ Allows for complex domain specific visualization​
● Vibrant ecosystem and community​
​
Open Chemistry, Avogadro, Jupyter and Web
● Making data more accessible
● Federated, open data repositories
● Modern HTML5 interfaces
● JSON data format for NWChem data as a prototype, add to other QM codes
● What about working with the data?
● Can we have chemistry from desktop-to-phone
○ Create data, upload, organize
○ Search and analyze data
○ Share data - email, social media, publications
● What if we tied a data server to a Jupyter notebook?
● Can we make data a first class citizen in modern workflows?
Increased Reusability
● Benefit from a huge number of open source packages/projects
● Quantum chemistry codes
○ NWChem, Psi4, ...
● Open source libraries/utilities
○ Avogadro, Open Babel, cclib, RDKit, ...
● Visualization, charting, etc
○ vtk.js, 3DMol.js, D3, plotly, matplotlib, ...
● Web frameworks
○ React, stencil.js, npm, ...
● Languages
○ C++, Python, JavaScript, TypeScript, ...
● Containers
○ Docker, singularity, shifter, ...
Also version control such as git,
continuous integration such as CircleCI,
build systems such as CMake, project
hosting such as GitHub, hardware
accelerated rendering such as WebGL,
queuing systems like grid engine,
semantic data stores like Jena, format
standards such as JSON,
MessagePack, HDF5, XML, HTTP,
RESTful web service standards, servers
such as nginx, CherryPy, Flask, and
many other components that are used
directly or gave useful input
Increased Reusability
● Developed on GitHub under permissive OSI-approved licenses
○ Industry standard 3-clause BSD and Apache 2 mainly
● Web widgets using stencil.js to offer web tags
● Binary wheels for Python wrapped Avogadro core
○ pip install avogadro
● Pip installable Python modules for standard functions
○ pip install openchemistry
● JupyterLab extensions that can be installed locally
● Binder for “live” notebooks hosted in cloud containers
● Quantum codes and machine learning models in Docker containers
● Establishing data standards for reliable data exchange
Approach and Philosophy
● Data is the core of the platform
○ Start with a simple but powerful date model and data server
● RESTful APIs are ubiquitous
○ Use from notebooks, apps, command line, desktop, etc
● Jupyter notebooks for interactive analysis
○ High level domain specific Python API within the notebooks
● Web application
○ Authentication, access control, management tasks
○ Launching, searching, managing notebooks
○ Interact with data outside of the notebook
Reusable Web Visualization Widgets
Data, Python, Jupyter, Chemistry
Responsive Design
Getting the Platform
Containers and the Swarm
Reproducibility for Chemical-Physics Data
● Dream - share results like we can currently share code
● Links to interactive pages displaying data
● Those pages link to workflows/Jupyter notebooks
● From input geometry/molecule through to final figure
● Docker containers offer known, reproducible binary
○ Metadata has input parameters, container ID, etc
● Aid reproducibility, machine learning, and education
● Federate access, offer full worked examples - editable!
Docker Containers for Chemical-Physics
● Developed three containers so far to serve the platform
○ NWChem and Psi4 for computational chemistry
○ ChemML for machine learning
● These containers are self-contained workflow tools
○ Take JSON and input geometry
○ Use a Python-based execution script
○ Output JSON and optionally all output logs/data
● Run using Docker, Singularity, soon Shifter on AWS, locally, NERSC
● Simple contract making it easy to add more codes to the platform
○ Take some standard input, translate for your code, translate to standard output
○ Get workflow management, integration with Jupyter, visualization, ...
● The Dockerfile has build instructions, DockerHub hosts images
Psi4 Dockerfile
Running a Psi4 Docker Container
● Can be run independently of the framework
● docker run -v $(pwd):/data openchemistry/psi4:latest
○ -g /data/geometry.xyz
○ -p /data/parameters.json
○ -o /data/out.cjson
○ -s /data/scratch
● Runs a Python driver script that interprets switches
● Perform input/output translation, input generation, etc
● Packages a code for use in a larger workflow
Running a NWChem Docker Container
● Can be run independently of the framework
● docker run -v $(pwd):/data openchemistry/nwchem:latest
○ -g /data/geometry.xyz
○ -p /data/parameters.json
○ -o /data/out.cjson
○ -s /data/scratch
● Runs a Python driver script that interprets switches
● Perform input/output translation, input generation, etc
● Packages a code for use in a larger workflow
Export to Binder
● Goes beyond simply showing the static notebook
● Specific GitHub repository layout
○ Install custom Python modules
○ Install JupyterLab extensions
● Service builds a container on the fly
● Can click on a link and run the example container
http://mybinder.org/v2/gh/openchemistry/jupyter-examples/master?urlpath=lab/tree/caffeine.ipynb
Export to Binder
Machine Learning
● What happens after your model is trained and published?
● Can we treat machine learning models like other codes making predictions?
● Lots of new moving parts that need to managed
○ The actual machine learning code, possible accelerator access, etc
○ The trained model, loading it, executing it reproducibly
○ Generation of relevant descriptors as part of the input
○ Extracting output, storing, displaying, and visualizing data
● Starts to share a number of commonalities with other simulations
● Important differences too
○ Narrower focus for most models
○ Possibility to augment trained models, create derived models
Running ChemML in a Jupyter Notebook
Data Mining
● When running calculations all data, metadata, workflows are captured
● Creation of a structured data store with a friendly frontend
● Possible to perform queries and perform analytics on the data generated
● Machine learning can feed off of this data
○ Reuse the same infrastructure to initiate and generate new data
○ Comparison of predicted data to computational codes, experimental data
○ Use of a familiar JupyterLab interface
● Augmenting the notebook with a data server that can access compute
○ Notebook acts as initiator for large jobs
○ Returning to the notebook later to check on progress
● Independent RESTful APIs, web frontend, batch export of data
Chemical JSON
● Developed to support projects (~2011)
● Stores structure, geometry, identifiers,
descriptors, other useful data
● Benefits:
○ More compact than XML/CML
○ Native to MongoDB, JSON-RPC, REST
○ Easily converted to binary representation
● Now features basis sets, MOs, sets
● MessagePack a good option for binary
● Maps easily to HDF5 binary data store
● MolSSI JSON schema collaboration
Papers and a Little History on Chemical JSON
● Quixote collaboration with Peter Murray-Rust (2011)
○ “The Quixote project: Collaborative and Open Quantum Chemistry data management in the
Internet age”, https://doi.org/10.1186/1758-2946-3-38
● Early work in CML with NWChem and Avogadro (2013)
○ “From data to analysis: linking NWChem and Avogadro with the syntax and semantics of
Chemical Markup Language” https://doi.org/10.1186/1758-2946-5-25
● Later moved to JSON, RESTful API, visualization (2017)
○ “Open chemistry: RESTful web APIs, JSON, NWChem and the modern web application”
○ https://doi.org/10.1186/s13321-017-0241-z
● Interested in Linked Data, JSON-LD, and how they might be layered on top
● Use of BSON, HDF5, and related technologies for binary data
● BSD licensed reference implementations
Pillars of Phase II SBIR Project
1. Data and metadata
○ JSON, JSON-LD, HDF5 and semantic web
2. Server platform
○ RESTful APIs, computational chemistry, data, machine learning, HPC/cloud, and triple store
3. Jupyter integration
○ Computational chemistry, data, machine learning, query, analytics, and data visualization
4. Web application
○ Management interfaces, single-page interface, notebook/data browser, and search
5. Avogadro and local Python
○ Python shell integration, extension of Avogadro to use server interface, editing data on server
Regular automated software deployments, releases with Docker containers
Closing Thoughts
● Nearly halfway through the Phase II project
● Data and software are both central and core to the platform
● Highly reusable through licensing, modular nature, data standards, containers
● Augmented by abstracted access to compute resources
● Open source, developing entry points for customization and extension
● Building on best-of-breed open source community projects
● Extending to better support the chemistry community
○ Just at the start of making machine learning and data mining first class citizens
● User friendly interfaces, Python at the core, visualization, data analytics
● SBIR funding from DOE Office of Science contract DE-SC0017193
○ Collaborating with Bert de Jong at Berkeley Lab and Johannes Hachmann at SUNY Buffalo

Más contenido relacionado

La actualidad más candente

PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesViach Kakovskyi
 
Avogadro: Open Source Libraries and Application for Computational Chemistry
Avogadro: Open Source Libraries and Application for Computational ChemistryAvogadro: Open Source Libraries and Application for Computational Chemistry
Avogadro: Open Source Libraries and Application for Computational ChemistryMarcus Hanwell
 
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Community
 
Bsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsdBsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsdScott Tsai
 
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...Viach Kakovskyi
 
Ergo platform's approach
Ergo platform's approachErgo platform's approach
Ergo platform's approachDmitry Meshkov
 
PrefetchML: a Framework for Prefetching and Caching Models
PrefetchML: a Framework for Prefetching and Caching ModelsPrefetchML: a Framework for Prefetching and Caching Models
PrefetchML: a Framework for Prefetching and Caching ModelsGwendal Daniel
 
BKK16-306 ART ii
BKK16-306 ART iiBKK16-306 ART ii
BKK16-306 ART iiLinaro
 
Code Crime Scene pawel klimczyk
Code Crime Scene   pawel klimczykCode Crime Scene   pawel klimczyk
Code Crime Scene pawel klimczykPawel Klimczyk
 

La actualidad más candente (11)

Go at uber
Go at uberGo at uber
Go at uber
 
24 uses for perl6
24 uses for perl624 uses for perl6
24 uses for perl6
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
 
Avogadro: Open Source Libraries and Application for Computational Chemistry
Avogadro: Open Source Libraries and Application for Computational ChemistryAvogadro: Open Source Libraries and Application for Computational Chemistry
Avogadro: Open Source Libraries and Application for Computational Chemistry
 
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
 
Bsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsdBsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsd
 
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
 
Ergo platform's approach
Ergo platform's approachErgo platform's approach
Ergo platform's approach
 
PrefetchML: a Framework for Prefetching and Caching Models
PrefetchML: a Framework for Prefetching and Caching ModelsPrefetchML: a Framework for Prefetching and Caching Models
PrefetchML: a Framework for Prefetching and Caching Models
 
BKK16-306 ART ii
BKK16-306 ART iiBKK16-306 ART ii
BKK16-306 ART ii
 
Code Crime Scene pawel klimczyk
Code Crime Scene   pawel klimczykCode Crime Scene   pawel klimczyk
Code Crime Scene pawel klimczyk
 

Similar a Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and SparkFelix Crisan
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...WebCamp
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Spring Data Neo4j: Graph Power Your Enterprise Apps
Spring Data Neo4j: Graph Power Your Enterprise AppsSpring Data Neo4j: Graph Power Your Enterprise Apps
Spring Data Neo4j: Graph Power Your Enterprise AppsGraphAware
 
Python workshop
Python workshopPython workshop
Python workshopShiraz LUG
 
WebCamp Ukraine 2016: Instant messenger with Python. Back-end development
WebCamp Ukraine 2016: Instant messenger with Python. Back-end developmentWebCamp Ukraine 2016: Instant messenger with Python. Back-end development
WebCamp Ukraine 2016: Instant messenger with Python. Back-end developmentViach Kakovskyi
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusBoldRadius Solutions
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016Nikhil Shekhar
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analyticsKyle Bader
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013scorlosquet
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsClaudiu Coman
 
Python in Industry
Python in IndustryPython in Industry
Python in IndustryDharmit Shah
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesSrinath Perera
 

Similar a Open Chemistry, JupyterLab and data: Reproducible quantum chemistry (20)

Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...
WebCamp 2016: Python. Вячеслав Каковский: Real-time мессенджер на Python. Осо...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Spring Data Neo4j: Graph Power Your Enterprise Apps
Spring Data Neo4j: Graph Power Your Enterprise AppsSpring Data Neo4j: Graph Power Your Enterprise Apps
Spring Data Neo4j: Graph Power Your Enterprise Apps
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Python workshop
Python workshopPython workshop
Python workshop
 
Python workshop
Python workshopPython workshop
Python workshop
 
WebCamp Ukraine 2016: Instant messenger with Python. Back-end development
WebCamp Ukraine 2016: Instant messenger with Python. Back-end developmentWebCamp Ukraine 2016: Instant messenger with Python. Back-end development
WebCamp Ukraine 2016: Instant messenger with Python. Back-end development
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Nikhil summer internship 2016
Nikhil   summer internship 2016Nikhil   summer internship 2016
Nikhil summer internship 2016
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013The Semantic Web and Drupal 7 - Loja 2013
The Semantic Web and Drupal 7 - Loja 2013
 
KEDL DBpedia 2019
KEDL DBpedia  2019KEDL DBpedia  2019
KEDL DBpedia 2019
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
 

Último

User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 

Último (20)

User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

  • 1. Reproducible Quantum Chemistry Dr. Marcus D. Hanwell @mhanwell Technical Leader American Chemical Society Orlando, FL 31 March, 2019
  • 2. What Is Open Chemistry? ● Umbrella of related projects to coordinate and group ○ Focus on 3-clause BSD permissively licensed projects ○ Aims for more complete solution ● Initially three related projects ○ Avogadro 2 - editor, visualization, interaction with small number of molecules ○ MoleQueue - running computational jobs, abstracting local and remote execution ○ MongoChem - database for interacting with many molecules, summarizing data, informatics ● Evolved over the years but still retains many of those goals ○ GitHub organization with 35 repositories at the last count ● Umbrella organization in Google Summer of Code ○ Four years, with 3, 7, 7, and TBD students over a broad range of projects ○ Hope to continue this and other community engagement activities https://openchemistry.org/
  • 3. Why Jupyter? ● Supports interactive analysis while preserving the analytic steps​ ○ Preserves much of the provenance​ ● Familiar environment and language​ ○ Many are already familiar with the environment​ ○ Python is the language of scientific computing​ ● Simple extension mechanism​ ○ Particularly with JupyterLab​ ○ Allows for complex domain specific visualization​ ● Vibrant ecosystem and community​ ​
  • 4. Open Chemistry, Avogadro, Jupyter and Web ● Making data more accessible ● Federated, open data repositories ● Modern HTML5 interfaces ● JSON data format for NWChem data as a prototype, add to other QM codes ● What about working with the data? ● Can we have chemistry from desktop-to-phone ○ Create data, upload, organize ○ Search and analyze data ○ Share data - email, social media, publications ● What if we tied a data server to a Jupyter notebook? ● Can we make data a first class citizen in modern workflows?
  • 5.
  • 6.
  • 7. Increased Reusability ● Benefit from a huge number of open source packages/projects ● Quantum chemistry codes ○ NWChem, Psi4, ... ● Open source libraries/utilities ○ Avogadro, Open Babel, cclib, RDKit, ... ● Visualization, charting, etc ○ vtk.js, 3DMol.js, D3, plotly, matplotlib, ... ● Web frameworks ○ React, stencil.js, npm, ... ● Languages ○ C++, Python, JavaScript, TypeScript, ... ● Containers ○ Docker, singularity, shifter, ... Also version control such as git, continuous integration such as CircleCI, build systems such as CMake, project hosting such as GitHub, hardware accelerated rendering such as WebGL, queuing systems like grid engine, semantic data stores like Jena, format standards such as JSON, MessagePack, HDF5, XML, HTTP, RESTful web service standards, servers such as nginx, CherryPy, Flask, and many other components that are used directly or gave useful input
  • 8. Increased Reusability ● Developed on GitHub under permissive OSI-approved licenses ○ Industry standard 3-clause BSD and Apache 2 mainly ● Web widgets using stencil.js to offer web tags ● Binary wheels for Python wrapped Avogadro core ○ pip install avogadro ● Pip installable Python modules for standard functions ○ pip install openchemistry ● JupyterLab extensions that can be installed locally ● Binder for “live” notebooks hosted in cloud containers ● Quantum codes and machine learning models in Docker containers ● Establishing data standards for reliable data exchange
  • 9. Approach and Philosophy ● Data is the core of the platform ○ Start with a simple but powerful date model and data server ● RESTful APIs are ubiquitous ○ Use from notebooks, apps, command line, desktop, etc ● Jupyter notebooks for interactive analysis ○ High level domain specific Python API within the notebooks ● Web application ○ Authentication, access control, management tasks ○ Launching, searching, managing notebooks ○ Interact with data outside of the notebook
  • 11.
  • 12.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 20.
  • 23. Reproducibility for Chemical-Physics Data ● Dream - share results like we can currently share code ● Links to interactive pages displaying data ● Those pages link to workflows/Jupyter notebooks ● From input geometry/molecule through to final figure ● Docker containers offer known, reproducible binary ○ Metadata has input parameters, container ID, etc ● Aid reproducibility, machine learning, and education ● Federate access, offer full worked examples - editable!
  • 24. Docker Containers for Chemical-Physics ● Developed three containers so far to serve the platform ○ NWChem and Psi4 for computational chemistry ○ ChemML for machine learning ● These containers are self-contained workflow tools ○ Take JSON and input geometry ○ Use a Python-based execution script ○ Output JSON and optionally all output logs/data ● Run using Docker, Singularity, soon Shifter on AWS, locally, NERSC ● Simple contract making it easy to add more codes to the platform ○ Take some standard input, translate for your code, translate to standard output ○ Get workflow management, integration with Jupyter, visualization, ... ● The Dockerfile has build instructions, DockerHub hosts images
  • 26. Running a Psi4 Docker Container ● Can be run independently of the framework ● docker run -v $(pwd):/data openchemistry/psi4:latest ○ -g /data/geometry.xyz ○ -p /data/parameters.json ○ -o /data/out.cjson ○ -s /data/scratch ● Runs a Python driver script that interprets switches ● Perform input/output translation, input generation, etc ● Packages a code for use in a larger workflow
  • 27. Running a NWChem Docker Container ● Can be run independently of the framework ● docker run -v $(pwd):/data openchemistry/nwchem:latest ○ -g /data/geometry.xyz ○ -p /data/parameters.json ○ -o /data/out.cjson ○ -s /data/scratch ● Runs a Python driver script that interprets switches ● Perform input/output translation, input generation, etc ● Packages a code for use in a larger workflow
  • 28. Export to Binder ● Goes beyond simply showing the static notebook ● Specific GitHub repository layout ○ Install custom Python modules ○ Install JupyterLab extensions ● Service builds a container on the fly ● Can click on a link and run the example container http://mybinder.org/v2/gh/openchemistry/jupyter-examples/master?urlpath=lab/tree/caffeine.ipynb
  • 30. Machine Learning ● What happens after your model is trained and published? ● Can we treat machine learning models like other codes making predictions? ● Lots of new moving parts that need to managed ○ The actual machine learning code, possible accelerator access, etc ○ The trained model, loading it, executing it reproducibly ○ Generation of relevant descriptors as part of the input ○ Extracting output, storing, displaying, and visualizing data ● Starts to share a number of commonalities with other simulations ● Important differences too ○ Narrower focus for most models ○ Possibility to augment trained models, create derived models
  • 31. Running ChemML in a Jupyter Notebook
  • 32. Data Mining ● When running calculations all data, metadata, workflows are captured ● Creation of a structured data store with a friendly frontend ● Possible to perform queries and perform analytics on the data generated ● Machine learning can feed off of this data ○ Reuse the same infrastructure to initiate and generate new data ○ Comparison of predicted data to computational codes, experimental data ○ Use of a familiar JupyterLab interface ● Augmenting the notebook with a data server that can access compute ○ Notebook acts as initiator for large jobs ○ Returning to the notebook later to check on progress ● Independent RESTful APIs, web frontend, batch export of data
  • 33. Chemical JSON ● Developed to support projects (~2011) ● Stores structure, geometry, identifiers, descriptors, other useful data ● Benefits: ○ More compact than XML/CML ○ Native to MongoDB, JSON-RPC, REST ○ Easily converted to binary representation ● Now features basis sets, MOs, sets ● MessagePack a good option for binary ● Maps easily to HDF5 binary data store ● MolSSI JSON schema collaboration
  • 34. Papers and a Little History on Chemical JSON ● Quixote collaboration with Peter Murray-Rust (2011) ○ “The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age”, https://doi.org/10.1186/1758-2946-3-38 ● Early work in CML with NWChem and Avogadro (2013) ○ “From data to analysis: linking NWChem and Avogadro with the syntax and semantics of Chemical Markup Language” https://doi.org/10.1186/1758-2946-5-25 ● Later moved to JSON, RESTful API, visualization (2017) ○ “Open chemistry: RESTful web APIs, JSON, NWChem and the modern web application” ○ https://doi.org/10.1186/s13321-017-0241-z ● Interested in Linked Data, JSON-LD, and how they might be layered on top ● Use of BSON, HDF5, and related technologies for binary data ● BSD licensed reference implementations
  • 35. Pillars of Phase II SBIR Project 1. Data and metadata ○ JSON, JSON-LD, HDF5 and semantic web 2. Server platform ○ RESTful APIs, computational chemistry, data, machine learning, HPC/cloud, and triple store 3. Jupyter integration ○ Computational chemistry, data, machine learning, query, analytics, and data visualization 4. Web application ○ Management interfaces, single-page interface, notebook/data browser, and search 5. Avogadro and local Python ○ Python shell integration, extension of Avogadro to use server interface, editing data on server Regular automated software deployments, releases with Docker containers
  • 36. Closing Thoughts ● Nearly halfway through the Phase II project ● Data and software are both central and core to the platform ● Highly reusable through licensing, modular nature, data standards, containers ● Augmented by abstracted access to compute resources ● Open source, developing entry points for customization and extension ● Building on best-of-breed open source community projects ● Extending to better support the chemistry community ○ Just at the start of making machine learning and data mining first class citizens ● User friendly interfaces, Python at the core, visualization, data analytics ● SBIR funding from DOE Office of Science contract DE-SC0017193 ○ Collaborating with Bert de Jong at Berkeley Lab and Johannes Hachmann at SUNY Buffalo