SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Automated Machine Learning Applied to
Diverse Materials Design Problems
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
MRS Spring Meeting, 2019
Slides (already) posted to hackingmaterials.lbl.gov
2
There are many algorithms developed for machine learning
in materials – new ones are constantly reported!
3
Q: Which one is the “best” based
on all the literature reports?
4
Q: Which one is the “best” based
on all the literature reports?
A: Can’t tell! They are (almost?)
all tested on different data sets.
• Different data sets
– Source (e.g., OQMD vs MP)
– Quantity (e.g., MP 2018 vs MP 2019)
– Subset / data filtering (e.g., ehull<X)
• Different cross validation
metrics
– e.g., what fraction is test set?
• Often, this can’t be helped
– Usually can’t access training /
test data of past works
– Sometimes no runnable version
of a published algorithm
– should referees be tougher on this?
5
Difficulty of comparing different ML algorithms
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
6
Outline
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
7
Outline
• We want a test set that contains a diverse array
of problems
– Smaller data versus larger data
– Different applications (electronic, mechanical, etc.)
– Composition-only or structure information available
– Classification or regression
• We also want a cross-validation metric that gives
reliable error estimates
– i.e., less dependent on specific choice of splits
8
A standard test method for ML algorithms in materials
9
Overview of Matbench test set
Target Property Data Source Samples Method
Bulk Modulus Materials Project 10,987 DFT-GGA
Shear Modulus Materials Project 10,987 DFT-GGA
Band Gap Materials Project 106,113 DFT-GGA
Metallicity Materials Project 106,113 DFT-GGA
Band Gap Zhuo et al. [1] 6,354 Experiment
Metallicity Zhuo et al. [1] 6,354 Experiment
Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment
Refractive index Materials Project 4,764 DFPT-GGA
Formation Energy Materials Project 132,752 DFT-GGA
Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA
Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA
Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF
Steel yield strength Citrine Informatics 312 Experiment
1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
<1K
1K-10K10K-100K
>100K
10
Diversity of benchmark suite
mechanical
electronic
stability
optical
thermal
classification
regression
experiment
(composition
only)
DFT
(structure)
application data size
problem
type
data type
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
11
Outline
12
Most commonly used test procedure
• Training/validation
is used for model
selection
• Test / hold-out is
used only for error
estimation
(Test set should not
inform model
selection, i.e. “final
answer”)
Think of it as N different “universes” – we have a different
training of the model in each universe and a different hold-out.
13
Nested CV – like hold-out, but varies the hold-out set
Think of it as N different “universes” – we have a different
training of the model in each universe and a different hold-out.
14
Nested CV – like hold-out, but varies the hold-out set
“A nested CV procedure provides an almost unbiased estimate of the true error.”
Varma and Simon, Bias in error estimation when using cross-validation for model
selection (2006)
• Matbench is a curated set of data sets that provide a
diverse set of problems representative of those
found in materials science
• ML developers can work on a consistent set of test
problems
• Ideally – consistent reports of error in the literature!
• Matbench v1 will be released soon …
– Let us know if you have feedback / comments /
suggestions!
15
Summary of Matbench
• Matbench: a standard test method for materials
science problems
– A set of diverse materials data sets for testing
– A consistent cross-validation strategy
• Automatminer: A “black box” materials science
ML algorithm
– Materials-specific descriptors using matminer
– AutoML to tune hyperparameters
16
Outline
17
Typically several steps of machine learning are performed by
a human researcher – can these be automated?
Descriptors developed and
chosen by a researcher
ML model developed
and chosen by a
researcher
Why can’t we just give the computer some raw input data
(compositions, crystal structures) and output properties and get
back an ML model?
18
Automatminer is a ”black box” machine learning model
Give it any data set with either composition or structure inputs, and
automatminer will train an ML model (no researcher intervention)
19
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
20
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
>60 featurizer classes can
generate thousands of potential
descriptors that are described in
the literature
21
Matminer contains a library of descriptors for various
materials science entities
feat = EwaldEnergy([options])
y = feat.featurize([input_data])
• compatible with scikit-
learn pipelining
• automatically deploy
multiprocessing to
parallelize over data
• include citations to
methodology papers
22
The matminer library is available for open use
Ward et al. Matminer : An open
source toolkit for materials data
mining. Computational Materials
Science, 152, 60–69 (2018).
Paper Docs Support
hackingmaterials.github.io
/matminer
https://groups.google.com/
forum/#!forum/matminer
23
Automatminer develops an ML model automatically given
raw data (structures or compositions plus output properties)
Featurizer
MagPie
SOAP
Sine Coulomb Matrix
+ many, many more
• Dropping
features with
many errors
• Missing value
imputation
• One-hot
encoding
• PCA-based
• Correlation
• Model-
based (tree)
Uses genetic
algorithms to find
the best machine
learning model +
hyperparameters
• TPOT uses genetic algorithms to determine
the best ML model and hyperparameters
using the training / validation set
– Also some internal feature reduction, scaling,
etc. – a full pipeline of operations
• Menu of ML options is all the algorithms
implemented in scikit-learn
– i.e., not neural networks
• Parameters include population size and
number of generations for genetic algorithm
– Tradeoff between CPU time and performance
– Auto-convergence or early stop possible
24
TPOT for AutoML
Olson, R. S. & Moore, J. H. TPOT: A Tree-based Pipeline Optimization Tool for
Automating Machine Learning. in Proceedings of the Workshop on Automatic Machine
Learning (eds. Hutter, F., Kotthoff, L. & Vanschoren, J.) 64, 66–74 (PMLR, 2016).
• Comparison 1: CGCNN
• Comparison 2: MEGNET
• Comparison 3: Untuned random forest (“no frills”)
– MAGPIE features for composition
– MAGPIE + Sine Coulomb matrix for structure
25
Comparing automatminer against state-of-the-art
Xie, T. & Grossman, J. C. Phys. Rev. Lett. 120, 145301 (2018).
Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P.
arXiv:1812.05055 (2018).
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj
Computational Materials 2, 16028–16028 (2016).
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj
Computational Materials 2, 16028–16028 (2016).
26
Matbench results for all algorithms
27
How does data set size affect performance?
For all structure-based regression problems, divide the mean absolute
error of model by mean absolute deviation of the data set.
• Always predicting the mean would yield a value of 1.0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
100 1000 10000 100000 1000000
MAE/MAD
Data Set Size
automatminer
CGCNN
MEGNET
• Automatminer is much faster / easier to train
– One can adjust training time of all algorithms to some
extent
– Note that MEGNET is faster than CGCNN but same
order of magnitude
• GPUs might greatly accelerate CGCNN /
MEGNET training (no timing available)
28
Algorithm training time per fold on 8-16 CPU cores
Data set size Automatminer CGNN MEGNET
~1K ~1 hour or less ~few hours ~few hours
~10K ~few hours ~few days ~few days
~100K ~12 hours ~few weeks ~few weeks
29
Getting started with automatminer
Paper Docs Support
hackingmaterials.github.io
/automatminer
https://groups.google.com/
forum/#!forum/matminer
In preparation …
• We proposed a diverse benchmark test suite of
problems to develop and test ML algorithms against
• We presented a black-box ML algorithm,
Automatminer, that performs comparably or
outperforms literature values on small data sets
(N<10,000), but does more poorly on larger data sets
• Further upgrades to automatminer are in progress!
– See if we can do better on N>10,000 problems
– Although crystal networks might alternately use transfer
learning to tackle N<10,000 problems (e.g., MEGNET)
30
Conclusions
31
Acknowledgements
Alex Dunn
Graduate student
Qi Wang
Postdoc
Alex Ganose
Postdoc
Alireza Faghaninia
Postdoc
Samy Cherfaoui
Undergraduate
Daniel Dopp
Undergraduate
Funding:
U.S. Department
of Energy, Basic
Energy Sciences
Slides (already) posted to
hackingmaterials.lbl.gov

Más contenido relacionado

La actualidad más candente

Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectAnubhav Jain
 
Computational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsComputational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsAnubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Anubhav Jain
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningAnubhav Jain
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructureAnubhav Jain
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Anubhav Jain
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data AnalyticsAnubhav Jain
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Anubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Anubhav Jain
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Designaimsnist
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...Anubhav Jain
 
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...Anubhav Jain
 

La actualidad más candente (20)

Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Computational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsComputational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methods
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
DuraMat Data Analytics
DuraMat Data AnalyticsDuraMat Data Analytics
DuraMat Data Analytics
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...The Materials Project: Experiences from running a million computational scien...
The Materials Project: Experiences from running a million computational scien...
 
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
 

Similar a Automated ML for Materials Design

The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Anubhav Jain
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Anubhav Jain
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningKAMAL CHOUDHARY
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Manuel Martín
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupSri Ambati
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biologyNeil Swainston
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML modelaimsnist
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsFrancesca Lazzeri, PhD
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Lionel Briand
 
AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...Deakin University
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchArtemSunfun
 

Similar a Automated ML for Materials Design (20)

The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
The importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systemsThe importance of model fairness and interpretability in AI systems
The importance of model fairness and interpretability in AI systems
 
Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
 
AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...AI for automated materials discovery via learning to represent, predict, gene...
AI for automated materials discovery via learning to represent, predict, gene...
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
 

Más de Anubhav Jain

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAnubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software disseminationAnubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software disseminationAnubhav Jain
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Anubhav Jain
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Anubhav Jain
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst DesignAnubhav Jain
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAnubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …Anubhav Jain
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials ProjectAnubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectAnubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...Anubhav Jain
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignAnubhav Jain
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignAnubhav Jain
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAnubhav Jain
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
 

Más de Anubhav Jain (20)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Assessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data AnalysisAssessing Factors Underpinning PV Degradation through Data Analysis
Assessing Factors Underpinning PV Degradation through Data Analysis
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 

Último

projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2AuEnriquezLontok
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPRPirithiRaju
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfAtiaGohar1
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 

Último (20)

projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
 
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
6.2 Pests of Sesame_Identification_Binomics_Dr.UPR
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 

Automated ML for Materials Design

  • 1. Automated Machine Learning Applied to Diverse Materials Design Problems Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA MRS Spring Meeting, 2019 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. 2 There are many algorithms developed for machine learning in materials – new ones are constantly reported!
  • 3. 3 Q: Which one is the “best” based on all the literature reports?
  • 4. 4 Q: Which one is the “best” based on all the literature reports? A: Can’t tell! They are (almost?) all tested on different data sets.
  • 5. • Different data sets – Source (e.g., OQMD vs MP) – Quantity (e.g., MP 2018 vs MP 2019) – Subset / data filtering (e.g., ehull<X) • Different cross validation metrics – e.g., what fraction is test set? • Often, this can’t be helped – Usually can’t access training / test data of past works – Sometimes no runnable version of a published algorithm – should referees be tougher on this? 5 Difficulty of comparing different ML algorithms Data set used in study A Data set used in study B Data set used in study C
  • 6. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 6 Outline
  • 7. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 7 Outline
  • 8. • We want a test set that contains a diverse array of problems – Smaller data versus larger data – Different applications (electronic, mechanical, etc.) – Composition-only or structure information available – Classification or regression • We also want a cross-validation metric that gives reliable error estimates – i.e., less dependent on specific choice of splits 8 A standard test method for ML algorithms in materials
  • 9. 9 Overview of Matbench test set Target Property Data Source Samples Method Bulk Modulus Materials Project 10,987 DFT-GGA Shear Modulus Materials Project 10,987 DFT-GGA Band Gap Materials Project 106,113 DFT-GGA Metallicity Materials Project 106,113 DFT-GGA Band Gap Zhuo et al. [1] 6,354 Experiment Metallicity Zhuo et al. [1] 6,354 Experiment Bulk Metallic Glass formation Landolt -Bornstein 7,190 Experiment Refractive index Materials Project 4,764 DFPT-GGA Formation Energy Materials Project 132,752 DFT-GGA Perovskite Formation Energy Castelli et al [2] 18,928 DFT-GGA Freq. at Last Phonon PhDOS Peak Materials Project 1,296 DFPT-GGA Exfoliation Energy JARVIS-2D 636 DFT-vDW-DF Steel yield strength Citrine Informatics 312 Experiment 1. doi.org/10.1021/acs.jpclett.8b00124 2. doi.org/10.1039/C2EE22341D
  • 10. <1K 1K-10K10K-100K >100K 10 Diversity of benchmark suite mechanical electronic stability optical thermal classification regression experiment (composition only) DFT (structure) application data size problem type data type
  • 11. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 11 Outline
  • 12. 12 Most commonly used test procedure • Training/validation is used for model selection • Test / hold-out is used only for error estimation (Test set should not inform model selection, i.e. “final answer”)
  • 13. Think of it as N different “universes” – we have a different training of the model in each universe and a different hold-out. 13 Nested CV – like hold-out, but varies the hold-out set
  • 14. Think of it as N different “universes” – we have a different training of the model in each universe and a different hold-out. 14 Nested CV – like hold-out, but varies the hold-out set “A nested CV procedure provides an almost unbiased estimate of the true error.” Varma and Simon, Bias in error estimation when using cross-validation for model selection (2006)
  • 15. • Matbench is a curated set of data sets that provide a diverse set of problems representative of those found in materials science • ML developers can work on a consistent set of test problems • Ideally – consistent reports of error in the literature! • Matbench v1 will be released soon … – Let us know if you have feedback / comments / suggestions! 15 Summary of Matbench
  • 16. • Matbench: a standard test method for materials science problems – A set of diverse materials data sets for testing – A consistent cross-validation strategy • Automatminer: A “black box” materials science ML algorithm – Materials-specific descriptors using matminer – AutoML to tune hyperparameters 16 Outline
  • 17. 17 Typically several steps of machine learning are performed by a human researcher – can these be automated? Descriptors developed and chosen by a researcher ML model developed and chosen by a researcher Why can’t we just give the computer some raw input data (compositions, crystal structures) and output properties and get back an ML model?
  • 18. 18 Automatminer is a ”black box” machine learning model Give it any data set with either composition or structure inputs, and automatminer will train an ML model (no researcher intervention)
  • 19. 19 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  • 20. 20 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  • 21. >60 featurizer classes can generate thousands of potential descriptors that are described in the literature 21 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) • compatible with scikit- learn pipelining • automatically deploy multiprocessing to parallelize over data • include citations to methodology papers
  • 22. 22 The matminer library is available for open use Ward et al. Matminer : An open source toolkit for materials data mining. Computational Materials Science, 152, 60–69 (2018). Paper Docs Support hackingmaterials.github.io /matminer https://groups.google.com/ forum/#!forum/matminer
  • 23. 23 Automatminer develops an ML model automatically given raw data (structures or compositions plus output properties) Featurizer MagPie SOAP Sine Coulomb Matrix + many, many more • Dropping features with many errors • Missing value imputation • One-hot encoding • PCA-based • Correlation • Model- based (tree) Uses genetic algorithms to find the best machine learning model + hyperparameters
  • 24. • TPOT uses genetic algorithms to determine the best ML model and hyperparameters using the training / validation set – Also some internal feature reduction, scaling, etc. – a full pipeline of operations • Menu of ML options is all the algorithms implemented in scikit-learn – i.e., not neural networks • Parameters include population size and number of generations for genetic algorithm – Tradeoff between CPU time and performance – Auto-convergence or early stop possible 24 TPOT for AutoML Olson, R. S. & Moore, J. H. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning. in Proceedings of the Workshop on Automatic Machine Learning (eds. Hutter, F., Kotthoff, L. & Vanschoren, J.) 64, 66–74 (PMLR, 2016).
  • 25. • Comparison 1: CGCNN • Comparison 2: MEGNET • Comparison 3: Untuned random forest (“no frills”) – MAGPIE features for composition – MAGPIE + Sine Coulomb matrix for structure 25 Comparing automatminer against state-of-the-art Xie, T. & Grossman, J. C. Phys. Rev. Lett. 120, 145301 (2018). Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. arXiv:1812.05055 (2018). Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj Computational Materials 2, 16028–16028 (2016). Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. npj Computational Materials 2, 16028–16028 (2016).
  • 26. 26 Matbench results for all algorithms
  • 27. 27 How does data set size affect performance? For all structure-based regression problems, divide the mean absolute error of model by mean absolute deviation of the data set. • Always predicting the mean would yield a value of 1.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100 1000 10000 100000 1000000 MAE/MAD Data Set Size automatminer CGCNN MEGNET
  • 28. • Automatminer is much faster / easier to train – One can adjust training time of all algorithms to some extent – Note that MEGNET is faster than CGCNN but same order of magnitude • GPUs might greatly accelerate CGCNN / MEGNET training (no timing available) 28 Algorithm training time per fold on 8-16 CPU cores Data set size Automatminer CGNN MEGNET ~1K ~1 hour or less ~few hours ~few hours ~10K ~few hours ~few days ~few days ~100K ~12 hours ~few weeks ~few weeks
  • 29. 29 Getting started with automatminer Paper Docs Support hackingmaterials.github.io /automatminer https://groups.google.com/ forum/#!forum/matminer In preparation …
  • 30. • We proposed a diverse benchmark test suite of problems to develop and test ML algorithms against • We presented a black-box ML algorithm, Automatminer, that performs comparably or outperforms literature values on small data sets (N<10,000), but does more poorly on larger data sets • Further upgrades to automatminer are in progress! – See if we can do better on N>10,000 problems – Although crystal networks might alternately use transfer learning to tackle N<10,000 problems (e.g., MEGNET) 30 Conclusions
  • 31. 31 Acknowledgements Alex Dunn Graduate student Qi Wang Postdoc Alex Ganose Postdoc Alireza Faghaninia Postdoc Samy Cherfaoui Undergraduate Daniel Dopp Undergraduate Funding: U.S. Department of Energy, Basic Energy Sciences Slides (already) posted to hackingmaterials.lbl.gov