Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Chemxseer qr-sagnik
1. Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh
Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang,
James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan,
Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray
Choudhury
Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences
and Technology
Pennsylvania State University, University Park, PA, USA
Past funding: NSF Cyberinfrastructure Chemistry, Microsoft
Current Support: Dow Chemical
http://chemxseer.ist.psu.edu
2. Talk Overview
● Challenges and Motivation.
● Functionalities
–
–
–
–
–
–
–
Fulltext Search
Author Search
Table Search
Figure Search
Expertise Search
Chemical Name and Formula Tagging
Chemical Name and Formula Search
● Summary.
9. Sample Table Metadata Extracted File
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
<Table>
<DocumentOrigin>Analyst</DocumentOrigin>
<DocumentName>b006011i.pdf</DocumentName>
<Year>2001</Year>
gas sensors </DocumentTitle>
<DocumentTitle>Detection of chlorinated methanes by tin oxide
Shaw, a Kenneth E. Creasy,* b and
<Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R .
of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University
3060</Author>
<TheNumOfCiters></TheNumOfCiters>
<Citers></Citers>
ge ( D R ) and response timeof tin
<TableCaption>Table 1 Temperature effect o n r esistance chan
oxide thin film with 1 % C Cl 4</TableCaption>
2 ) (%) R esponse time Reproducibiliy
<TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O
</TableColumnHeading>
300 1027 21 < 2 0 s Yes 400 993 31 ~ 1
<TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes
0 s No </TableContent>
>
<TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote
<ColumnNum>5</ColumnNum>
1% CCl4 at different temperatures are
<TableReferenceText>In page 3, line 11, … Film responses to
summarized in Table 1……</TableReferenceText>
<PageNumOfTable>3</PageNumOfTable>
<Snapshot>b006011i/b006011i_t1.jpg</Snapshot>
</Table>
11. ChemXSeer Figure/Plot Data Extraction
and Search
Numerical data in
scientific publications
are often found in figures.
No search engine allows
searching on figures and their
data in chemical documents.
Tools that automate the data extraction from figures and allow
search on them can provide the following:
•
•
•
•
Increases our understanding of key concepts of papers.
Provides data for automatic comparative analyses.
Enables regeneration of figures in different contexts.
Enables search for documents with figures containing specific
experiment results.
X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
13. ChemXSeer Name and Formula
Extraction and Search
• Extraction and search of chemical names and formulae in
scientific documents has been shown to be very useful.
• Extraction and search on chemical names is hard:
– Many chemical molecules are created everyday, any dictionary based
name recognizer will fail eventually.
– Names need to segmented to get semantically meaningful sub-terms
such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.
• Identifying formula is hard:
• “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)
• “… such as hydroxyl radical OH, superoxide O2- …” (formula)
• For searching, formulae cannot be treated as text.
• Domain knowledge (formula identification)
•
Structural knowledge (substructure finding and search)
B. Sun, et.al., WWW 2007, WWW 2008, TOIS
14. Chemical Entity Extraction and Tagging
● Name tagging
– Each chemical name can be a phrase
– Example
● "... Determination of lactic acid and ...“
● "... insecticide promecarb (3-isopropyl-5-methylphenyl
methylcarbamate) acts against ..."
● Formula tagging
– Each formula is a single term
– Example
● "... such as hydroxyl radical OH, superoxide ..."
– Non-formula example
● "... YSI 5301, Yellow Springs, OH, USA ... ”
● Tagging examples
– Name tagging:
"... of <name-type>lactic acid</name-type> and ...“
– Formula tagging:
"...
radical <formula-type>OH</formula-type> , superoxide ..."
15. Online Chemical Entity Tagger
● We have an open source chemical name and formula
tagger and a web based interface for evaluation.
● The interface takes a PDF file as input, returns text of
the PDF with names or formulas tagged.
16. Online Chemical Entity Tagger: Chemical
Name Tagging Example
● Results on a sample PDF.
● Some chemical formula erroneously identified as chemical
name (loss of precision).
● High recall (most chemical names identified)
17. Online Chemical Entity Tagger: Chemical
Formula Tagging Example
● Results on a sample PDF.
● Some chemical formulas not identified (loss of recall).
● High precision (words identified as formula are actual formulas)
18. Chemical Name Indexing and Search
• Index Schemes:
– Which tokens to index?
– Indexing all subsequences generates a large size
index
– “but” in “butane” is morpheme, but not for “nembutal”.
● Segmentation-based index scheme
– Used for indexing chemical names
– First segment a chemical name hierarchically and then index
substrings at each node if frequent.
– acetaldoxime->aldoxime->oxime.
– Search for oxime returns all, depending on ranking function.
– This can not be done in usual text search.
20. Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Built on top of millions of
papers in CiteSeerX.
A similar system was
developed for Dow
Chemicals.
Can find experts in “polymer
chemistry” or expertise of
“Linus Pauling”
Finds an expert based on
their publications.
Many approaches:
Keyphases
Citations
Download count.
Affiliation
Treeratpituk, Chen, JCDL’13
21. Future Work
Lots of interesting work to do! Few computer/machine
learning scientists involved.
•
•
•
•
•
•
•
•
•
•
Acquisitions - more documents, data, knowledge
Chemical 3D graph search
Fundamental chemical graph representation analysis
Table data storage and access
Figure search and data extraction and access
New data and feature search
• spectra, experimental methods, instrumentation
New documents: 400K PubMed
Semantic chemical graphs
Expert/collaborator search
Search integration of all features
The first data mining task is to detect chemical names and formulas from the literature.
So the task of entity tagging is to find the hidden labels of each term in the text
The first data mining task is to detect chemical names and formulas from the literature.
So the task of entity tagging is to find the hidden labels of each term in the text
The first data mining task is to detect chemical names and formulas from the literature.
So the task of entity tagging is to find the hidden labels of each term in the text
The first data mining task is to detect chemical names and formulas from the literature.
So the task of entity tagging is to find the hidden labels of each term in the text
most of those substrings on the tree are semantically meaningful