The Open Chemistry project is developing an ambitious platform to facilitate reproducible quantum chemistry workflows by integrating the best of breed open source projects currently available in a cohesive platform with extensions specific to the needs of quantum chemistry. The core of the project is a Python-based data server capable of storing metadata, executing quantum chemistry calculations, and processing the output. The platform exposes RESTful endpoints using programming language agnostic web endpoints, and uses Linux container technology to package quantum codes that are often difficult to build.
The Jupyter project has been leveraged as a web-based frontend offering reproducibility as a core principle. This has been coupled with the data server to initiate quantum chemistry calculations, cache results, make them searchable, and even visualize the results within a modern browser environment. The Avogadro libraries have been reused for visualization workflows, coupled with Open Babel for file translation, and examples of the use of NWChem and Psi4 will be demonstrated.
The core of the platform is developed upon JSON data standards, and encouraging the wider adoption of JSON/HDF5 as the principle storage mediums. A single page web application using React at its core will be shown for sharing simple views of data output, and linking to the Jupyter notebooks that documents how they were made. Command line tools and links to the Avogadro graphical interface will be shown demonstrating capabilities from web through to desktop.
2. What Is Open Chemistry?
● Umbrella of related projects to coordinate and group
○ Focus on 3-clause BSD permissively licensed projects
○ Aims for more complete solution
● Initially three related projects
○ Avogadro 2 - editor, visualization, interaction with small number of molecules
○ MoleQueue - running computational jobs, abstracting local and remote execution
○ MongoChem - database for interacting with many molecules, summarizing data, informatics
● Evolved over the years but still retains many of those goals
○ GitHub organization with 35 repositories at the last count
● Umbrella organization in Google Summer of Code
○ Four years, with 3, 7, 7, and TBD students over a broad range of projects
○ Hope to continue this and other community engagement activities
https://openchemistry.org/
3. Why Jupyter?
● Supports interactive analysis while preserving the analytic steps
○ Preserves much of the provenance
● Familiar environment and language
○ Many are already familiar with the environment
○ Python is the language of scientific computing
● Simple extension mechanism
○ Particularly with JupyterLab
○ Allows for complex domain specific visualization
● Vibrant ecosystem and community
4. Open Chemistry, Avogadro, Jupyter and Web
● Making data more accessible
● Federated, open data repositories
● Modern HTML5 interfaces
● JSON data format for NWChem data as a prototype, add to other QM codes
● What about working with the data?
● Can we have chemistry from desktop-to-phone
○ Create data, upload, organize
○ Search and analyze data
○ Share data - email, social media, publications
● What if we tied a data server to a Jupyter notebook?
● Can we make data a first class citizen in modern workflows?
5.
6.
7. Increased Reusability
● Benefit from a huge number of open source packages/projects
● Quantum chemistry codes
○ NWChem, Psi4, ...
● Open source libraries/utilities
○ Avogadro, Open Babel, cclib, RDKit, ...
● Visualization, charting, etc
○ vtk.js, 3DMol.js, D3, plotly, matplotlib, ...
● Web frameworks
○ React, stencil.js, npm, ...
● Languages
○ C++, Python, JavaScript, TypeScript, ...
● Containers
○ Docker, singularity, shifter, ...
Also version control such as git,
continuous integration such as CircleCI,
build systems such as CMake, project
hosting such as GitHub, hardware
accelerated rendering such as WebGL,
queuing systems like grid engine,
semantic data stores like Jena, format
standards such as JSON,
MessagePack, HDF5, XML, HTTP,
RESTful web service standards, servers
such as nginx, CherryPy, Flask, and
many other components that are used
directly or gave useful input
8. Increased Reusability
● Developed on GitHub under permissive OSI-approved licenses
○ Industry standard 3-clause BSD and Apache 2 mainly
● Web widgets using stencil.js to offer web tags
● Binary wheels for Python wrapped Avogadro core
○ pip install avogadro
● Pip installable Python modules for standard functions
○ pip install openchemistry
● JupyterLab extensions that can be installed locally
● Binder for “live” notebooks hosted in cloud containers
● Quantum codes and machine learning models in Docker containers
● Establishing data standards for reliable data exchange
9. Approach and Philosophy
● Data is the core of the platform
○ Start with a simple but powerful date model and data server
● RESTful APIs are ubiquitous
○ Use from notebooks, apps, command line, desktop, etc
● Jupyter notebooks for interactive analysis
○ High level domain specific Python API within the notebooks
● Web application
○ Authentication, access control, management tasks
○ Launching, searching, managing notebooks
○ Interact with data outside of the notebook
23. Reproducibility for Chemical-Physics Data
● Dream - share results like we can currently share code
● Links to interactive pages displaying data
● Those pages link to workflows/Jupyter notebooks
● From input geometry/molecule through to final figure
● Docker containers offer known, reproducible binary
○ Metadata has input parameters, container ID, etc
● Aid reproducibility, machine learning, and education
● Federate access, offer full worked examples - editable!
24. Docker Containers for Chemical-Physics
● Developed three containers so far to serve the platform
○ NWChem and Psi4 for computational chemistry
○ ChemML for machine learning
● These containers are self-contained workflow tools
○ Take JSON and input geometry
○ Use a Python-based execution script
○ Output JSON and optionally all output logs/data
● Run using Docker, Singularity, soon Shifter on AWS, locally, NERSC
● Simple contract making it easy to add more codes to the platform
○ Take some standard input, translate for your code, translate to standard output
○ Get workflow management, integration with Jupyter, visualization, ...
● The Dockerfile has build instructions, DockerHub hosts images
26. Running a Psi4 Docker Container
● Can be run independently of the framework
● docker run -v $(pwd):/data openchemistry/psi4:latest
○ -g /data/geometry.xyz
○ -p /data/parameters.json
○ -o /data/out.cjson
○ -s /data/scratch
● Runs a Python driver script that interprets switches
● Perform input/output translation, input generation, etc
● Packages a code for use in a larger workflow
27. Running a NWChem Docker Container
● Can be run independently of the framework
● docker run -v $(pwd):/data openchemistry/nwchem:latest
○ -g /data/geometry.xyz
○ -p /data/parameters.json
○ -o /data/out.cjson
○ -s /data/scratch
● Runs a Python driver script that interprets switches
● Perform input/output translation, input generation, etc
● Packages a code for use in a larger workflow
28. Export to Binder
● Goes beyond simply showing the static notebook
● Specific GitHub repository layout
○ Install custom Python modules
○ Install JupyterLab extensions
● Service builds a container on the fly
● Can click on a link and run the example container
http://mybinder.org/v2/gh/openchemistry/jupyter-examples/master?urlpath=lab/tree/caffeine.ipynb
30. Machine Learning
● What happens after your model is trained and published?
● Can we treat machine learning models like other codes making predictions?
● Lots of new moving parts that need to managed
○ The actual machine learning code, possible accelerator access, etc
○ The trained model, loading it, executing it reproducibly
○ Generation of relevant descriptors as part of the input
○ Extracting output, storing, displaying, and visualizing data
● Starts to share a number of commonalities with other simulations
● Important differences too
○ Narrower focus for most models
○ Possibility to augment trained models, create derived models
32. Data Mining
● When running calculations all data, metadata, workflows are captured
● Creation of a structured data store with a friendly frontend
● Possible to perform queries and perform analytics on the data generated
● Machine learning can feed off of this data
○ Reuse the same infrastructure to initiate and generate new data
○ Comparison of predicted data to computational codes, experimental data
○ Use of a familiar JupyterLab interface
● Augmenting the notebook with a data server that can access compute
○ Notebook acts as initiator for large jobs
○ Returning to the notebook later to check on progress
● Independent RESTful APIs, web frontend, batch export of data
33. Chemical JSON
● Developed to support projects (~2011)
● Stores structure, geometry, identifiers,
descriptors, other useful data
● Benefits:
○ More compact than XML/CML
○ Native to MongoDB, JSON-RPC, REST
○ Easily converted to binary representation
● Now features basis sets, MOs, sets
● MessagePack a good option for binary
● Maps easily to HDF5 binary data store
● MolSSI JSON schema collaboration
34. Papers and a Little History on Chemical JSON
● Quixote collaboration with Peter Murray-Rust (2011)
○ “The Quixote project: Collaborative and Open Quantum Chemistry data management in the
Internet age”, https://doi.org/10.1186/1758-2946-3-38
● Early work in CML with NWChem and Avogadro (2013)
○ “From data to analysis: linking NWChem and Avogadro with the syntax and semantics of
Chemical Markup Language” https://doi.org/10.1186/1758-2946-5-25
● Later moved to JSON, RESTful API, visualization (2017)
○ “Open chemistry: RESTful web APIs, JSON, NWChem and the modern web application”
○ https://doi.org/10.1186/s13321-017-0241-z
● Interested in Linked Data, JSON-LD, and how they might be layered on top
● Use of BSON, HDF5, and related technologies for binary data
● BSD licensed reference implementations
35. Pillars of Phase II SBIR Project
1. Data and metadata
○ JSON, JSON-LD, HDF5 and semantic web
2. Server platform
○ RESTful APIs, computational chemistry, data, machine learning, HPC/cloud, and triple store
3. Jupyter integration
○ Computational chemistry, data, machine learning, query, analytics, and data visualization
4. Web application
○ Management interfaces, single-page interface, notebook/data browser, and search
5. Avogadro and local Python
○ Python shell integration, extension of Avogadro to use server interface, editing data on server
Regular automated software deployments, releases with Docker containers
36. Closing Thoughts
● Nearly halfway through the Phase II project
● Data and software are both central and core to the platform
● Highly reusable through licensing, modular nature, data standards, containers
● Augmented by abstracted access to compute resources
● Open source, developing entry points for customization and extension
● Building on best-of-breed open source community projects
● Extending to better support the chemistry community
○ Just at the start of making machine learning and data mining first class citizens
● User friendly interfaces, Python at the core, visualization, data analytics
● SBIR funding from DOE Office of Science contract DE-SC0017193
○ Collaborating with Bert de Jong at Berkeley Lab and Johannes Hachmann at SUNY Buffalo