SlideShare una empresa de Scribd logo
1 de 22
Search Engine and Repository for eChemistry
C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh
Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang,
James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan,
Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray
Choudhury
Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences
and Technology
Pennsylvania State University, University Park, PA, USA

Past funding: NSF Cyberinfrastructure Chemistry, Microsoft
Current Support: Dow Chemical

http://chemxseer.ist.psu.edu
Talk Overview
● Challenges and Motivation.
● Functionalities
–
–
–
–
–
–
–

Fulltext Search
Author Search
Table Search
Figure Search
Expertise Search
Chemical Name and Formula Tagging
Chemical Name and Formula Search

● Summary.
Based on cyberinfrastructure
for CiteSeerX
Built on Solr/Lucene,
SeerSuite, other OSS
ChemXSeer RSC
ChemXSeer Fulltext Search
ChemXSeer Author Search
ChemXSeer Table Search
• Tables are widely used to present experimental results or
statistical data in scientific documents.
• Existing search engines treat tabular data as regular text
– Structural information and semantics not preserved.
– We automatically identify tables and extract table metadata in xml.
Table Metadata Representation:
• Environment metadata: (document specifics: type, title,…)
• Frame metadata: (border left, right, top, bottom, …)
• Affiliated metadata: (Caption, footnote, …)
• Layout metadata: (number of rows, columns, headers,…)
• Cell content metadata: (values in cells)
• Type metadata: (numeric, symbolic, hybrid, …)

Y. Liu, et.al, AAAI 2007, JCDL 2007.
Sample Table Metadata Extracted File
Sample Table Metadata Extracted File
•
•
•
•
•
•

•
•
•
•
•
•
•
•
•
•
•

<Table>
<DocumentOrigin>Analyst</DocumentOrigin>
<DocumentName>b006011i.pdf</DocumentName>
<Year>2001</Year>
gas sensors </DocumentTitle>
<DocumentTitle>Detection of chlorinated methanes by tin oxide
Shaw, a Kenneth E. Creasy,* b and
<Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R .
of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University
3060</Author>
<TheNumOfCiters></TheNumOfCiters>
<Citers></Citers>
ge ( D R ) and response timeof tin
<TableCaption>Table 1 Temperature effect o n r esistance chan
oxide thin film with 1 % C Cl 4</TableCaption>
2 ) (%) R esponse time Reproducibiliy
<TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O
</TableColumnHeading>
300 1027 21 < 2 0 s Yes 400 993 31 ~ 1
<TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes
0 s No </TableContent>
>
<TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote
<ColumnNum>5</ColumnNum>
1% CCl4 at different temperatures are
<TableReferenceText>In page 3, line 11, … Film responses to
summarized in Table 1……</TableReferenceText>
<PageNumOfTable>3</PageNumOfTable>
<Snapshot>b006011i/b006011i_t1.jpg</Snapshot>
</Table>
ChemXSeer Table Search
ChemXSeer Figure/Plot Data Extraction
and Search
Numerical data in
scientific publications
are often found in figures.
No search engine allows
searching on figures and their
data in chemical documents.
Tools that automate the data extraction from figures and allow
search on them can provide the following:
•
•
•
•

Increases our understanding of key concepts of papers.
Provides data for automatic comparative analyses.
Enables regeneration of figures in different contexts.
Enables search for documents with figures containing specific
experiment results.
X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
Our Contribution
ChemXSeer Name and Formula
Extraction and Search
• Extraction and search of chemical names and formulae in
scientific documents has been shown to be very useful.
• Extraction and search on chemical names is hard:
– Many chemical molecules are created everyday, any dictionary based
name recognizer will fail eventually.
– Names need to segmented to get semantically meaningful sub-terms
such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”.

• Identifying formula is hard:
• “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula)
• “… such as hydroxyl radical OH, superoxide O2- …” (formula)

• For searching, formulae cannot be treated as text.
• Domain knowledge (formula identification)
•

Structural knowledge (substructure finding and search)

B. Sun, et.al., WWW 2007, WWW 2008, TOIS
Chemical Entity Extraction and Tagging
● Name tagging
– Each chemical name can be a phrase
– Example
● "... Determination of lactic acid and ...“
● "... insecticide promecarb (3-isopropyl-5-methylphenyl
methylcarbamate) acts against ..."

● Formula tagging
– Each formula is a single term
– Example
● "... such as hydroxyl radical OH, superoxide ..."

– Non-formula example
● "... YSI 5301, Yellow Springs, OH, USA ... ”

● Tagging examples
– Name tagging:
"... of <name-type>lactic acid</name-type> and ...“

– Formula tagging:
"...

radical <formula-type>OH</formula-type> , superoxide ..."
Online Chemical Entity Tagger
● We have an open source chemical name and formula
tagger and a web based interface for evaluation.
● The interface takes a PDF file as input, returns text of
the PDF with names or formulas tagged.
Online Chemical Entity Tagger: Chemical
Name Tagging Example
● Results on a sample PDF.
● Some chemical formula erroneously identified as chemical
name (loss of precision).
● High recall (most chemical names identified)
Online Chemical Entity Tagger: Chemical
Formula Tagging Example
● Results on a sample PDF.
● Some chemical formulas not identified (loss of recall).
● High precision (words identified as formula are actual formulas)
Chemical Name Indexing and Search
• Index Schemes:
– Which tokens to index?
– Indexing all subsequences generates a large size
index
– “but” in “butane” is morpheme, but not for “nembutal”.

● Segmentation-based index scheme
– Used for indexing chemical names
– First segment a chemical name hierarchically and then index
substrings at each node if frequent.
– acetaldoxime->aldoxime->oxime.
– Search for oxime returns all, depending on ranking function.
– This can not be done in usual text search.
Example Formula Search

http://chemxseer.ist.psu.edu/ChemXSeerFormulaSearch/help.htm
Expert Recommendation - CiteSeerX
http://seerseer.ist.psu.edu (new version CSSeers)
Built on top of millions of
papers in CiteSeerX.
A similar system was
developed for Dow
Chemicals.
Can find experts in “polymer
chemistry” or expertise of
“Linus Pauling”
Finds an expert based on
their publications.
Many approaches:
Keyphases
Citations
Download count.
Affiliation
Treeratpituk, Chen, JCDL’13
Future Work
Lots of interesting work to do! Few computer/machine
learning scientists involved.
•
•
•
•
•
•
•
•
•
•

Acquisitions - more documents, data, knowledge
Chemical 3D graph search
Fundamental chemical graph representation analysis
Table data storage and access
Figure search and data extraction and access
New data and feature search
• spectra, experimental methods, instrumentation
New documents: 400K PubMed
Semantic chemical graphs
Expert/collaborator search
Search integration of all features
DEMO

Más contenido relacionado

Similar a Chemxseer qr-sagnik

How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesBruce Slutsky
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)BIOVIA
 
Mukesh Kumar Resume
Mukesh Kumar ResumeMukesh Kumar Resume
Mukesh Kumar Resumemukeshkr1
 
How To Study Organic Chem
How To Study Organic ChemHow To Study Organic Chem
How To Study Organic Chemshehdilanun
 
Systematic reviews - a "how to" guide
Systematic reviews - a "how to" guideSystematic reviews - a "how to" guide
Systematic reviews - a "how to" guideIsla Kuhn
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...Andrew McEachran
 
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCHCOMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCHLauren Bradshaw
 
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...ChemAxon
 
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSyed Asad Rahman
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Peter Kenny
 

Similar a Chemxseer qr-sagnik (20)

A new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChemA new, automated retrosynthetic search engine: ARChem
A new, automated retrosynthetic search engine: ARChem
 
WWW (Glibs workshop)
WWW (Glibs workshop)WWW (Glibs workshop)
WWW (Glibs workshop)
 
How to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical SubstancesHow to Find Physical Properties of Chemical Substances
How to Find Physical Properties of Chemical Substances
 
Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)Self-Contained Sequence Representation (SCSR)
Self-Contained Sequence Representation (SCSR)
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Mukesh Kumar Resume
Mukesh Kumar ResumeMukesh Kumar Resume
Mukesh Kumar Resume
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
How To Study Organic Chem
How To Study Organic ChemHow To Study Organic Chem
How To Study Organic Chem
 
Systematic reviews - a "how to" guide
Systematic reviews - a "how to" guideSystematic reviews - a "how to" guide
Systematic reviews - a "how to" guide
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCHCOMPLETE GUIDE ON WRITING  A CCOT ESSAY  ON CHEMISTRY RESEARCH
COMPLETE GUIDE ON WRITING A CCOT ESSAY ON CHEMISTRY RESEARCH
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
Organic chemist
Organic chemistOrganic chemist
Organic chemist
 
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...USUGM 2014 -  Gregory Landrum (Novartis): What else can you do with the Marku...
USUGM 2014 - Gregory Landrum (Novartis): What else can you do with the Marku...
 
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
Using Cheminformatics Approaches to Develop a Structure Searchable Database o...
 
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshopSAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
SAR_EMBL_EBI_EC_BLAST_NOV_2013_Industry_workshop
 
Structural databases
Structural databases Structural databases
Structural databases
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)Design of compound libraries for fragment screening (Feb 2012 version)
Design of compound libraries for fragment screening (Feb 2012 version)
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 

Último

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Último (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Chemxseer qr-sagnik

  • 1. Search Engine and Repository for eChemistry C. Lee Giles, Prasenjit Mitra, Karl Mueller, Levent Bolelli, Xiaonan Lu, Saurabh Kataria, Ying Liu, Anuj Jaiswal, Kun Bai, Bingjun Sun, Isaac Councill, James Z. Wang, James Kubicki, Barbara Garrison, William Brouwer, Joel Bandstra, Qingzhao Tan, Juan Pablo Ramirez Fernandez, Madian Khabsa, Hung-Hsuan Chen, Sagnik Ray Choudhury Chemistry, Computer Sciences and Engineering, Geosciences, Information Sciences and Technology Pennsylvania State University, University Park, PA, USA Past funding: NSF Cyberinfrastructure Chemistry, Microsoft Current Support: Dow Chemical http://chemxseer.ist.psu.edu
  • 2. Talk Overview ● Challenges and Motivation. ● Functionalities – – – – – – – Fulltext Search Author Search Table Search Figure Search Expertise Search Chemical Name and Formula Tagging Chemical Name and Formula Search ● Summary.
  • 3. Based on cyberinfrastructure for CiteSeerX Built on Solr/Lucene, SeerSuite, other OSS
  • 7. ChemXSeer Table Search • Tables are widely used to present experimental results or statistical data in scientific documents. • Existing search engines treat tabular data as regular text – Structural information and semantics not preserved. – We automatically identify tables and extract table metadata in xml. Table Metadata Representation: • Environment metadata: (document specifics: type, title,…) • Frame metadata: (border left, right, top, bottom, …) • Affiliated metadata: (Caption, footnote, …) • Layout metadata: (number of rows, columns, headers,…) • Cell content metadata: (values in cells) • Type metadata: (numeric, symbolic, hybrid, …) Y. Liu, et.al, AAAI 2007, JCDL 2007.
  • 8. Sample Table Metadata Extracted File
  • 9. Sample Table Metadata Extracted File • • • • • • • • • • • • • • • • • <Table> <DocumentOrigin>Analyst</DocumentOrigin> <DocumentName>b006011i.pdf</DocumentName> <Year>2001</Year> gas sensors </DocumentTitle> <DocumentTitle>Detection of chlorinated methanes by tin oxide Shaw, a Kenneth E. Creasy,* b and <Author>Sang Hyun Park, a ? Young-Chan Son, a Brenda R . of Connecticut, Storrs, C T 06269Steven L. Suib* acd a Department of Chemistry, U-60, University 3060</Author> <TheNumOfCiters></TheNumOfCiters> <Citers></Citers> ge ( D R ) and response timeof tin <TableCaption>Table 1 Temperature effect o n r esistance chan oxide thin film with 1 % C Cl 4</TableCaption> 2 ) (%) R esponse time Reproducibiliy <TableColumnHeading>D R Temperature/ ¡ã C D R a / W ( R ,O </TableColumnHeading> 300 1027 21 < 2 0 s Yes 400 993 31 ~ 1 <TableContent>100 223 5 ~ 22 min Yes 200 270 9 ~ 7-8 min Yes 0 s No </TableContent> > <TableFootnote> a D R =( R , CCl 4 ) - ( R ,O 2 ). </TableFootnote <ColumnNum>5</ColumnNum> 1% CCl4 at different temperatures are <TableReferenceText>In page 3, line 11, … Film responses to summarized in Table 1……</TableReferenceText> <PageNumOfTable>3</PageNumOfTable> <Snapshot>b006011i/b006011i_t1.jpg</Snapshot> </Table>
  • 11. ChemXSeer Figure/Plot Data Extraction and Search Numerical data in scientific publications are often found in figures. No search engine allows searching on figures and their data in chemical documents. Tools that automate the data extraction from figures and allow search on them can provide the following: • • • • Increases our understanding of key concepts of papers. Provides data for automatic comparative analyses. Enables regeneration of figures in different contexts. Enables search for documents with figures containing specific experiment results. X. Lu, et.al, JCDL 2006., Ray Choudhury et al. JCDL 2013, ICDAR 2013
  • 13. ChemXSeer Name and Formula Extraction and Search • Extraction and search of chemical names and formulae in scientific documents has been shown to be very useful. • Extraction and search on chemical names is hard: – Many chemical molecules are created everyday, any dictionary based name recognizer will fail eventually. – Names need to segmented to get semantically meaningful sub-terms such as “methyl”, “ethyl” and “alcohol” from “methylethyl alcohol”. • Identifying formula is hard: • “… YSI 5301, Yellow Springs, OH, USA …” (Non-formula) • “… such as hydroxyl radical OH, superoxide O2- …” (formula) • For searching, formulae cannot be treated as text. • Domain knowledge (formula identification) • Structural knowledge (substructure finding and search) B. Sun, et.al., WWW 2007, WWW 2008, TOIS
  • 14. Chemical Entity Extraction and Tagging ● Name tagging – Each chemical name can be a phrase – Example ● "... Determination of lactic acid and ...“ ● "... insecticide promecarb (3-isopropyl-5-methylphenyl methylcarbamate) acts against ..." ● Formula tagging – Each formula is a single term – Example ● "... such as hydroxyl radical OH, superoxide ..." – Non-formula example ● "... YSI 5301, Yellow Springs, OH, USA ... ” ● Tagging examples – Name tagging: "... of <name-type>lactic acid</name-type> and ...“ – Formula tagging: "... radical <formula-type>OH</formula-type> , superoxide ..."
  • 15. Online Chemical Entity Tagger ● We have an open source chemical name and formula tagger and a web based interface for evaluation. ● The interface takes a PDF file as input, returns text of the PDF with names or formulas tagged.
  • 16. Online Chemical Entity Tagger: Chemical Name Tagging Example ● Results on a sample PDF. ● Some chemical formula erroneously identified as chemical name (loss of precision). ● High recall (most chemical names identified)
  • 17. Online Chemical Entity Tagger: Chemical Formula Tagging Example ● Results on a sample PDF. ● Some chemical formulas not identified (loss of recall). ● High precision (words identified as formula are actual formulas)
  • 18. Chemical Name Indexing and Search • Index Schemes: – Which tokens to index? – Indexing all subsequences generates a large size index – “but” in “butane” is morpheme, but not for “nembutal”. ● Segmentation-based index scheme – Used for indexing chemical names – First segment a chemical name hierarchically and then index substrings at each node if frequent. – acetaldoxime->aldoxime->oxime. – Search for oxime returns all, depending on ranking function. – This can not be done in usual text search.
  • 20. Expert Recommendation - CiteSeerX http://seerseer.ist.psu.edu (new version CSSeers) Built on top of millions of papers in CiteSeerX. A similar system was developed for Dow Chemicals. Can find experts in “polymer chemistry” or expertise of “Linus Pauling” Finds an expert based on their publications. Many approaches: Keyphases Citations Download count. Affiliation Treeratpituk, Chen, JCDL’13
  • 21. Future Work Lots of interesting work to do! Few computer/machine learning scientists involved. • • • • • • • • • • Acquisitions - more documents, data, knowledge Chemical 3D graph search Fundamental chemical graph representation analysis Table data storage and access Figure search and data extraction and access New data and feature search • spectra, experimental methods, instrumentation New documents: 400K PubMed Semantic chemical graphs Expert/collaborator search Search integration of all features
  • 22. DEMO

Notas del editor

  1. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  2. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  3. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  4. The first data mining task is to detect chemical names and formulas from the literature. So the task of entity tagging is to find the hidden labels of each term in the text
  5. most of those substrings on the tree are semantically meaningful