SlideShare una empresa de Scribd logo
1 de 22
Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh
Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang,
James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan,
Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray
Choudhury
Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences
and Technology
Pennsylvania State University, University Park, PA, USA

Past funding: NSF Cyberinfrastructure Chemistry, Microsoft
Current Support: Dow Chemical

http://chemxseer.ist.psu.edu
Talk Overview
● Challenges and Motivation.
● Functionalities
–
–
–
–
–
–
–

Fulltext Search
Author Search
Table Search
Figure Search
Expertise Search
Chemical Name and Formula Tagging
Chemical Name and Formula Search

● Summary.
Based on cyberinfrastructure
for CiteSeerX
Built on Solr/Lucene,
SeerSuite, other OSS
ChemXSeer RSC
ChemXSeer Fulltext Search
ChemXSeer Author Search
ChemXSeer Table Search
• Tables are widely used to present experimental results or
statistical data in scientific documents.
• Existing search engines treat tabular data as regular text
– Structural information and semantics not preserved.
– We automatically identify tables and extract table metadata in xml.
Table Metadata Representation:
• Environment metadata: (document specifics: type, title,…)
• Frame metadata: (border left, right, top, bottom, …)
• Affiliated metadata: (Caption, footnote, …)
• Layout metadata: (number of rows, columns, headers,…)
• Cell content metadata: (values in cells)
• Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu, et.al, AAAI 2007, JCDL 2007.
Sample Table Metadata Extracted File
Sample Table Metadata Extracted File
•
•
•
•
•
•

•
•
•
•
•
•
•
•
•
•
•

<Table>
<DocumentOrigin>Analyst</DocumentOrigin>
<DocumentName>b006011i.pdf</DocumentName>
<Year>2001</Year>
gas sensors </DocumentTitle>
<DocumentTitle>Detection of chlorinated methanes by tin oxide
Shaw, a Kenneth E. Creasy,* b and
<Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R .
of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University
3060</Author>
<TheNumOfCiters></TheNumOfCiters>
<Citers></Citers>
ge ( D R ) and response timeof tin
<TableCaption>Table 1 Temperature effect o n r esistance chan
oxide thin film with 1 % C Cl 4</TableCaption>
2 ) (%) R esponse time Reproducibiliy
<TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O
</TableColumnHeading>
300 1027 21 < 2 0 s Yes 400 993 31 ~ 1
<TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes
0 s No </TableContent>
>
<TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote
<ColumnNum>5</ColumnNum>
1% CCl4 at different temperatures are
<TableReferenceText>In page 3, line 11, … Film responses to
summarized in Table 1……</TableReferenceText>
<PageNumOfTable>3</PageNumOfTable>
<Snapshot>b006011i/b006011i_t1.jpg</Snapshot>
</Table>
ChemXSeer Table Search
ChemXSeer Figure/Plot Data Extraction
and Search
Numerical data in
scientific publications
are often found in figures.
No search engine allows
searching on figures and their
data in chemical documents.
Tools that automate the data extraction from figures and allow
search on them can provide the following:
•
•
•
•

Increases our understanding of key concepts of papers.
Provides data for automatic comparative analyses.
Enables regeneration of figures in different contexts.
Enables search for documents with figures containing specific
experiment results.
X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
Our Contribution
ChemXSeer Name and Formula
Extraction and Search
• Extraction and search of chemical names and formulae in
scientific documents has been shown to be very useful.
• Extraction and search on chemical names is hard:
– Many chemical molecules are created everyday, any dictionary based
name recognizer will fail eventually.
– Names need to segmented to get semantically meaningful sub-terms
such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.

• Identifying formula is hard:
• “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)
• “… such as hydroxyl radical OH, superoxide O2- …” (formula)

• For searching, formulae cannot be treated as text.
• Domain knowledge (formula identification)
•

Structural knowledge (substructure finding and search)

B. Sun, et.al., WWW 2007, WWW 2008, TOIS
Chemical Entity Extraction and Tagging
● Name tagging
– Each chemical name can be a phrase
– Example
● "... Determination of lactic acid and ...“
● "... insecticide promecarb (3-isopropyl-5-methylphenyl
methylcarbamate) acts against ..."

● Formula tagging
– Each formula is a single term
– Example
● "... such as hydroxyl radical OH, superoxide ..."

– Non-formula example
● "... YSI 5301, Yellow Springs, OH, USA ... ”

● Tagging examples
– Name tagging:
"... of <name-type>lactic acid</name-type> and ...“

– Formula tagging:
"...

radical <formula-type>OH</formula-type> , superoxide ..."
Online Chemical Entity Tagger
● We have an open source chemical name and formula
tagger and a web based interface for evaluation.
● The interface takes a PDF file as input, returns text of
the PDF with names or formulas tagged.
Online Chemical Entity Tagger: Chemical
Name Tagging Example
● Results on a sample PDF.
● Some chemical formula erroneously identified as chemical
name (loss of precision).
● High recall (most chemical names identified)
Online Chemical Entity Tagger: Chemical
Formula Tagging Example
● Results on a sample PDF.
● Some chemical formulas not identified (loss of recall).
● High precision (words identified as formula are actual formulas)
Chemical Name Indexing and Search
• Index Schemes:
– Which tokens to index?
– Indexing all subsequences generates a large size
index
– “but” in “butane” is morpheme, but not for “nembutal”.

● Segmentation-based index scheme
– Used for indexing chemical names
– First segment a chemical name hierarchically and then index
substrings at each node if frequent.
– acetaldoxime->aldoxime->oxime.
– Search for oxime returns all, depending on ranking function.
– This can not be done in usual text search.
Example Formula Search

http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Built on top of millions of
papers in CiteSeerX.
A similar system was
developed for Dow
Chemicals.
Can find experts in “polymer
chemistry” or expertise of
“Linus Pauling”
Finds an expert based on
their publications.
Many approaches:
Keyphases
Citations
Download count.
Affiliation
Treeratpituk, Chen, JCDL’13
Future Work
Lots of interesting work to do! Few computer/machine
learning scientists involved.
•
•
•
•
•
•
•
•
•
•

Acquisitions - more documents, data, knowledge
Chemical 3D graph search
Fundamental chemical graph representation analysis
Table data storage and access
Figure search and data extraction and access
New data and feature search
• spectra, experimental methods, instrumentation
New documents: 400K PubMed
Semantic chemical graphs
Expert/collaborator search
Search integration of all features
DEMO

Más contenido relacionado

Similar a Chemxseer qr-sagnik

How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesBruce Slutsky
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)BIOVIA
 
Mukesh Kumar Resume
Mukesh Kumar ResumeMukesh Kumar Resume
Mukesh Kumar Resumemukeshkr1
 
How To Study Organic Chem
How To Study Organic ChemHow To Study Organic Chem
How To Study Organic Chemshehdilanun
 
Systematic reviews - a "how to" guide
Systematic reviews - a "how to" guideSystematic reviews - a "how to" guide
Systematic reviews - a "how to" guideIsla Kuhn
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...Andrew McEachran
 
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCHCOMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCHLauren Bradshaw
 
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...ChemAxon
 
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSyed Asad Rahman
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Peter Kenny
 

Similar a Chemxseer qr-sagnik (20)

A new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChemA new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChem
 
WWW (Glibs workshop)
WWW (Glibs workshop)WWW (Glibs workshop)
WWW (Glibs workshop)
 
How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical Substances
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Mukesh Kumar Resume
Mukesh Kumar ResumeMukesh Kumar Resume
Mukesh Kumar Resume
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
How To Study Organic Chem
How To Study Organic ChemHow To Study Organic Chem
How To Study Organic Chem
 
Systematic reviews - a "how to" guide
Systematic reviews - a "how to" guideSystematic reviews - a "how to" guide
Systematic reviews - a "how to" guide
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCHCOMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
Organic chemist
Organic chemistOrganic chemist
Organic chemist
 
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
 
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
 
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
 
Structural databases
Structural databases Structural databases
Structural databases
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Chemxseer qr-sagnik

  • 1. Search Engine and Repository for eChemistry C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology Pennsylvania State University, University Park, PA, USA Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical http://chemxseer.ist.psu.edu
  • 2. Talk Overview ● Challenges and Motivation. ● Functionalities – – – – – – – Fulltext Search Author Search Table Search Figure Search Expertise Search Chemical Name and Formula Tagging Chemical Name and Formula Search ● Summary.
  • 3. Based on cyberinfrastructure for CiteSeerX Built on Solr/Lucene, SeerSuite, other OSS
  • 7. ChemXSeer Table Search • Tables are widely used to present experimental results or statistical data in scientific documents. • Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml. Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …) Y. Liu, et.al, AAAI 2007, JCDL 2007.
  • 8. Sample Table Metadata Extracted File
  • 9. Sample Table Metadata Extracted File • • • • • • • • • • • • • • • • • <Table> <DocumentOrigin>Analyst</DocumentOrigin> <DocumentName>b006011i.pdf</DocumentName> <Year>2001</Year> gas sensors </DocumentTitle> <DocumentTitle>Detection of chlorinated methanes by tin oxide Shaw, a Kenneth E. Creasy,* b and <Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R . of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University 3060</Author> <TheNumOfCiters></TheNumOfCiters> <Citers></Citers> ge ( D R ) and response timeof tin <TableCaption>Table 1 Temperature effect o n r esistance chan oxide thin film with 1 % C Cl 4</TableCaption> 2 ) (%) R esponse time Reproducibiliy <TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O </TableColumnHeading> 300 1027 21 < 2 0 s Yes 400 993 31 ~ 1 <TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes 0 s No </TableContent> > <TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote <ColumnNum>5</ColumnNum> 1% CCl4 at different temperatures are <TableReferenceText>In page 3, line 11, … Film responses to summarized in Table 1……</TableReferenceText> <PageNumOfTable>3</PageNumOfTable> <Snapshot>b006011i/b006011i_t1.jpg</Snapshot> </Table>
  • 11. ChemXSeer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow search on them can provide the following: • • • • Increases our understanding of key concepts of papers. Provides data for automatic comparative analyses. Enables regeneration of figures in different contexts. Enables search for documents with figures containing specific experiment results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
  • 13. ChemXSeer Name and Formula Extraction and Search • Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard: – Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually. – Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”. • Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula) • “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, et.al., WWW 2007, WWW 2008, TOIS
  • 14. Chemical Entity Extraction and Tagging ● Name tagging – Each chemical name can be a phrase – Example ● "... Determination of lactic acid and ...“ ● "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..." ● Formula tagging – Each formula is a single term – Example ● "... such as hydroxyl radical OH, superoxide ..." – Non-formula example ● "... YSI 5301, Yellow Springs, OH, USA ... ” ● Tagging examples – Name tagging: "... of <name-type>lactic acid</name-type> and ...“ – Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."
  • 15. Online Chemical Entity Tagger ● We have an open source chemical name and formula tagger and a web based interface for evaluation. ● The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.
  • 16. Online Chemical Entity Tagger: Chemical Name Tagging Example ● Results on a sample PDF. ● Some chemical formula erroneously identified as chemical name (loss of precision). ● High recall (most chemical names identified)
  • 17. Online Chemical Entity Tagger: Chemical Formula Tagging Example ● Results on a sample PDF. ● Some chemical formulas not identified (loss of recall). ● High precision (words identified as formula are actual formulas)
  • 18. Chemical Name Indexing and Search • Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”. ● Segmentation-based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically and then index substrings at each node if frequent. – acetaldoxime->aldoxime->oxime. – Search for oxime returns all, depending on ranking function. – This can not be done in usual text search.
  • 20. Expert Recommendation - CiteSeerX http://seerseer.ist.psu.edu (new version CSSeers) Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Affiliation Treeratpituk, Chen, JCDL’13
  • 21. Future Work Lots of interesting work to do! Few computer/machine learning scientists involved. • • • • • • • • • • Acquisitions - more documents, data, knowledge Chemical 3D graph search Fundamental chemical graph representation analysis Table data storage and access Figure search and data extraction and access New data and feature search • spectra, experimental methods, instrumentation New documents: 400K PubMed Semantic chemical graphs Expert/collaborator search Search integration of all features
  • 22. DEMO

Notas del editor

  1. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  2. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  3. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  4. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  5. most of those substrings on the tree are semantically meaningful