Materials design using knowledge from millions of journal articles via natural language processing techniques
1. Materials design using knowledge from millions of
journal articles via natural language processing
techniques
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
IMX Virtual Seminar, April 11 2021
Slides (already) posted to hackingmaterials.lbl.gov
2. • Often, materials are known for several decades
before their functional applications are known
– MgB2 sitting on lab shelves for 50 years before its
identification as a superconductor in 2001
– LiFePO4 known since 1938, only identified as a Li-ion
battery cathode in 1997
• Even after discovery, optimization and
commercialization still take decades
• How is this typically done?
2
Typically, both new materials discovery and optimization
take decades
3. What constrains traditional approaches to materials design?
3
“[The Chevrel] discovery resulted from a lot of
unsuccessful experiments of Mg ions insertion
into well-known hosts for Li+ ions insertion, as
well as from the thorough literature analysis
concerning the possibility of divalent ions
intercalation into inorganic materials.”
-Aurbach group, on discovery of Chevrel cathode
for multivalent (e.g., Mg2+) batteries
Levi, Levi, Chasid, Aurbach
J. Electroceramics (2009)
4. 4
Researchers are starting to fundamentally re-think how we
invent the materials that make up our devices
Next-
generation
materials
design
Computer-
aided
materials
design
Natural
language
processing
“Self-driving
laboratories”
6. 6
Can ML help us work through our backlog of information we
need to assimilate from text sources?
papers to read “someday”
NLP algorithms
7. • It is difficult to look up all information any given material
due to the many different ways chemical compositions
are written
– a search for “TiNiSn” will give different results than “NiTiSn”
– a search for “GaSb” won’t match text that reads “Ga0.5Sb0.5”
– a search for “SnBi4Te7” won’t match text that reads “we studied
SnBi4X7 (X=S, Se, Te)”.
– a search for “AgCrSe2”, if it doesn’t have any hits, won’t suggest
“CuCrSe2” as a similar result
• It is difficult to ask questions or compile summaries, e.g.:
– What is the band gap of “Si”?
– What are all the known dopants into GaAs?
– What are all materials studied as thermoelectrics?
7
Traditional search doesn’t answer the questions we want
8. What is matscholar?
• Matscholar is an attempt to organize the world’s
information on materials science, connecting
together topics of study, synthesis and
characterization methods, and specific materials
compositions
• It is also an effort to use state-of-the-art natural
language processing to make collective use of
the information in millions of articles
9. One of our main projects concerns named entity
recognition, or automatically labeling text
9
This allows for search
and is crucial to
downstream tasks
10. 1
0
> 4 million
Papers Collected
31 million
Properties
19 million
Materials Mentions
8.8 million
Characterization Methods
7.5 million
Applications
5 million
Synthesis Methods
•Data Collection: Over 4 million papers
collected from more than 2100 journals.
Note – entities are currently extracted only from the abstracts of the papers
11. 11
Now we can search!
Live on www.matscholar.com
Live demo
12. • The publication data set is not complete
• Currently analyzing abstracts only
• The algorithms are not perfect
• The search interface could be improved further
• We would like to hear from you if you try this!
12
Limitations (it is not perfect)
13. 13
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
14. Extracted 4 million abstracts
of relevant scientific articles
using various APIs from
journal publishers
Some are more difficult than
others to obtain.
Data cleaning is often
needed (e.g., stray HTML
tags, copyright statements)
Abstract collection
continues …
14
Step 1 – data collection
15. 15
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
16. • First split the text into sentences
– Seems simple, but remember edge cases like ”et al.” or
“etc.” does not necessarily signify end of sentence despite
the period
• Then split the sentences into words
– Tricky things are detecting and normalizing chemical
formulas, selective lowercasing (“Battery” vs “battery” or
“BaS” vs “BAs”), homogenizing numbers, etc.
• Historically done with ChemDataExtractor* with
some custom improvements
– We are moving towards a fully custom tokenizer
16
Step 2 - tokenization
*http://chemdataextractor.org
17. 17
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
18. • Part A is marking abstracts
as relevant / non-relevant
to inorganic materials
science
• Part B is tediously labeling
~600 abstracts
– Largely done by one person
– Spot-check of 25 abstracts
by a second person gave
87.4% agreement
18
Step 3 – hand label abstracts
19. 19
How does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
20. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
20
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
21. • We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
21
Step 4a: the word2vec algorithm is used to “featurize” words
Barazza, L. How does Word2Vec’s Skip-Gram work? Becominghuman.ai. 2017
“You shall know a word by
the company it keeps”
- John Rupert Firth (1957)
22. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
22
Word embeddings trained on ”normal” text learns
relationships between words
23. 23
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
24. • If you read this sentence:
“The band gap of ___ is 4.5 eV”
It is clear that the blank should be filled in with a
material word (not a synthesis method, characterization
method, etc.)
How do we get a neural network to take into account
context (as well as properties of the word itself)?
24
Step 4b: How do we train a model to recognize context?
25. 25
Step 4b.An LSTM neural net classifies words by reading
word sequences
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
26. 26
Ok so how does this work? High-level view
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
27. 27
Step 5. Let the model label things for you!
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
Weston, L. et al Named Entity
Recognition and Normalization
Applied to Large-Scale
Information Extraction from
the Materials Science
Literature. J. Chem. Inf. Model.
(2019).
28. 28
Could these techniques also be used to predict which
materials we might want to screen for an application?
papers to read “someday”
NLP algorithms
29. • The classic example is:
– “king” - “man” + “woman” = ? → “queen”
29
Remember that word embeddings seem to learn
relationships in text
30. 30
For scientific text, it learns scientific concepts as well
crystal structures of the elements
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
31. 31
There seems to be materials knowledge encoded in the
word vectors
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
32. 32
Note that more data is not always better!
We want relevance
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
33. 33
Word embeddings also have the periodic table encoded in it
with no prior knowledge
“word embedding”
periodic table
Tshitoyan, V. et al. Unsupervised word embeddings capture latent
knowledge from materials science literature. Nature 571, 95–98 (2019).
34. • Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors!
(DFT+BoltzTraP)
34
Making predictions: dot products measure likelihood for
words to co-occur
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from
materials science literature. Nature 571, 95–98 (2019).
35. 35
Try ”going back in time” and ranking materials, and follow
what happens in later years
Tshitoyan, V. et al.
Unsupervised word
embeddings capture latent
knowledge from materials
science literature. Nature
571, 95–98 (2019).
36. – For every year since
2001, see which
compounds we would
have predicted using
only literature data until
that point in time
– Make predictions of
what materials are the
most promising
thermoelectrics for
data until that year
– See if those materials
were actually studied as
thermoelectrics in
subsequent years 36
A more comprehensive “back in time” test
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
37. 37
We also published a list of potential new thermoelectrics
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
It is one thing to
retroactively test, but
perhaps another to see
how things go after
publication
38. 38
Two were studied between submission and publication of
manuscript
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
39. 39
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
40. 40
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
41. 41
More were studied since then (mainly computationally)
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
https://arxiv.org/abs/2010.08461
42. 42
Our collaborators also synthesized a prediction, finding a
moderate zT of 0.14
Tshitoyan, V. et al. Unsupervised word embeddings capture
latent knowledge from materials science literature. Nature
571, 95–98 (2019).
43. 43
How is this working?
“Context
words” link
together
information
from different
sources
45. 45
1.Automatic creation of structured materials databases from
the literature, e.g. doping database
Sentence Base
Material
Dopant Doping
Concentr.
…the influence of yttrium doping (0-10mol%) on BSCF… BSCF Yttrium 0-10 mol%
undoped, anion-doped(Sb,Bi) and cation-doped(Ca,Zn) solid sln.
of Mg10Si2Sn3…
Mg10Si2Sn3 Sb, Bi, Ca, Zn
The zT of As2Cd3 with electron doping is found to be ~ with
n=10^20cm-3
As2Cd3 electron n=10^20cm-3
This leads to zT=0.5 obtained at 500K (p=10^20cm-3) in p-type
As2Cd3T
As2Cd3 p-type p=10^20cm-3
The undoped and 0.25wt% La doped CdO films show 111…
…however, …. for doping concentrations greater than 0.50wt%.
CdO La 0.25wt%,
>0.5%
Will allow you to answer questions like “what
are all the materials known to be doped with
Eu3+” ?
47. 47
2. Learning representations of materials
● Mat2vec suggested that embeddings contain chemical information
● Can we make embeddings for arbitrary materials as material descriptors?
● i.e., word embeddings for materials not in the literature
● Descriptors could be used for direct classification for application (link
prediction), or quantitative property prediction (regression features)
49. 49
Initial results – predicting experimental band gap from
composition (~3000 data points)
50. 50
3. Creating a comprehensive software library for materials
science NLP research (multiple LBNL research groups)
https://github.com/lbnlp
51. 51
4. Getting data from figures
Original figure Data snippet
extracted fully
automatically
Replotted data
Can we automatically extract structured information from figures?
52. 52
4a. Detecting the various regions of the plot
(a) & (b): Human
labeling of axes and
legend regions (141
training figures)
(c) – (f): Model
predictions based on
“faster_cnn_inception”
model
53. 53
4b.Automatically reading the axis scales
1. Detect numbers using
a custom configuration
of the EasyOCR package
2. Develop algorithm to
detect exponents based
on size and position
3. Set tick
marks at
center of text
height
Number
of
entries
POWER
BASE
a. Digit detection
(previous step)
b. Compile
coordinates of
detected numbers
c. Separate into groups of
low height and high height
54. 54
4c. Getting the data curves (color-based)
Starting image
1. Automatically detect distinct
colors using iterative k-means
clustering
2. Decide which color
channels contain
relevant data
56. • There is a lot of data and knowledge in the
historical corpus of scientific journal articles, but
getting the knowledge has been difficult to do on
a large scale
• Machine learning presents a new frontier for
being able to make use of this information
56
Conclusion
57. 57
The Matscholar team
Kristin Persson
Anubhav Jain
Gerbrand Ceder
John
Dagdelen
Leigh
Weston
Vahe
Tshitoyan
Amalie
Trewartha
Alex
Dunn
Viktoriia
Baibakova
Funding from
(now at Google) (now at Medium)
Slides (already) posted to
hackingmaterials.lbl.gov
+ DOE ARPA-e (figure extraction)