SlideShare a Scribd company logo
1 of 16
Recent advances in the project
EXCITE – Extraction of Citations
from PDF Documents
Philipp Mayr
GESIS – Leibniz Institute for the Social Sciences
2018-09-03, Bologna
http://excite.west.uni-koblenz.de/
#opencitations
#WOOC2018
EXCITE team
• PI: Steffen Staab (WeST), Philipp Mayr (GESIS)
• Researchers: Behnam Ghavimi, Zeyd Boukhers
• Developer: Azam Hosseini
• Collaborators: Heinrich Hartmann, Martin Körner
2
EXCITE: Background
3
• We run productive search systems and research in information
retrieval, recommendation systems and knowledge discovery
− SSOAR https://www.gesis.org/ssoar/ (48K full texts)
− GESIS Search https://search.gesis.org/ (242K data sets +
further materials)
• National literatures are not well represented in major citation
indices (like WoS, Scopus)
• Shortage of citation data for the international and German
social sciences (Social Science Citation Index is not enough)
• Open availability of citation data is improving but still very
limited
EXCITE: Main objectives
• Develop web services to allow third-parties to
extract citation data from arbitrary publications
• Develop a toolchain of reference extraction and
matching software
• Integrate and publish the extracted citation data in
reusable formats
• Narrow the supply gap of citation data in the social
sciences
4
EXCITE: toolchain
(1) Extraction of text from source documents (PDFs),
(2) Identification of reference sections and other forms of embedded reference
information within the text,
(3) Segmentation of individual references into its constituent fields such as author, title,
etc.,
(4) Matching of reference strings against bibliographic databases,
(5) Export and publication of matched references to reusable formats (convert to OCC)
5
Training data
EXCITE: recent advances
• All components are available as reusable components, see
https://github.com/exciteproject
• EXparser – tool to extracting and segment references (see talk
by Zeyd Boukhers: “A Generic Approach for Reference
Extraction from PDF Documents” tomorrow)
• Annotators and Gold standards – tools to annotate references
and different gold standards to train and test the tools
• EXmatcher – tool to match references to bibliographic
databases which base on solr, elasticsearch
• EXpublisher – tool to convert EXCITE data to JSON-LD
• Public demo http://excite.west.uni-koblenz.de/excite
• Extracted and matched data in productive systems, e.g.
https://search.gesis.org/publication/gesis-ssoar-10004
6
EXAnnotators: Reference Identification
7
http://excite.west.uni-koblenz.de/refanno
EXAnnotators: Reference Segmentation
8
http://excite.west.uni-koblenz.de/seganno
EXCITE: Demo
9
DEMO
EXCITE: Demo
10
Uploading File Display References
Result
http://excite.west.uni-koblenz.de/excite
EXmatcher
• Input are segmented reference strings with probabilities
for each segment
• Output are matched document ids
11
EXmatcher
hybrid
approach -
combination of
blocking
techniques
and a
classifier
algorithm
Input: strings,
segments,
probabilities 12
EXPublisher
• Converting extracted and matched data to the OCC ontology (incl.
EXCITE Identifier in the OCI)
• Enrichment of the reference information by external metadata
13https://github.com/exciteproject/EXpublisher
EXMatcher and ExPublisher will be included in the demo soon!
Next steps in EXCITE
• EXCITE references to be published in OpenCitationCorpus
• Public EXCITE API for testing (to be public soon)
• Reference Matching to Crossref to be added in the
demo/API
• Gold Standards (German/English/Reference
Section/Footnotes) to be completed
• Extractions models for German and English texts
• More Social Science data to be processed and released
• We try to process ArXiv for OCC
14
Thank you
Contact:
Dr Philipp Mayr
GESIS - Leibniz Institute for the Social Sciences, Germany
Email: philipp.mayr@gesis.org
Twitter: @philipp_mayr
• Project website
http://excite.west.uni-koblenz.de/
• EXCITE mailing list: Subscribe to our Newsletter.
• Demo http://excite.west.uni-koblenz.de/excite
• GIT https://github.com/exciteproject/
15
EXCITE: Toolchain
16

More Related Content

What's hot

Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Digital Methods Initiative
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
Simon Price
 
Dash: data sharing made easy
Dash: data sharing made easyDash: data sharing made easy
Dash: data sharing made easy
University of California Curation Center
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 

What's hot (20)

ODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiersODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiers
 
skos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systemsskos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systems
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Dash UCCSC 2016
 
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
 
Meadows apr28-1
Meadows apr28-1Meadows apr28-1
Meadows apr28-1
 
ODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentresODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentres
 
Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Federating Research Profiling Data
Federating Research Profiling DataFederating Research Profiling Data
Federating Research Profiling Data
 
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestab
 
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
 
Introduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed PentzIntroduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed Pentz
 
Dash: data sharing made easy
Dash: data sharing made easyDash: data sharing made easy
Dash: data sharing made easy
 
ORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice MeadowsORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice Meadows
 
Citations and References in DBpedia
Citations and References in DBpediaCitations and References in DBpedia
Citations and References in DBpedia
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
 

Similar to Recent advances in the project EXCITE – Extraction of Citations from PDF Documents

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
Moving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howMoving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & how
David T Palmer
 
Focus on research workshop
Focus on research workshopFocus on research workshop
Focus on research workshop
bellalli
 
MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017
Sarah Amrani
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Susanna-Assunta Sansone
 

Similar to Recent advances in the project EXCITE – Extraction of Citations from PDF Documents (20)

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
 
Research dissemination presentation
Research dissemination presentationResearch dissemination presentation
Research dissemination presentation
 
2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl
 
6_ULiege_presentation.pdf
6_ULiege_presentation.pdf6_ULiege_presentation.pdf
6_ULiege_presentation.pdf
 
Moving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howMoving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & how
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley Software
 
EOSC and libraries
EOSC and librariesEOSC and libraries
EOSC and libraries
 
Focus on research workshop
Focus on research workshopFocus on research workshop
Focus on research workshop
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 
British Library
British LibraryBritish Library
British Library
 
Open Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research LibrariesOpen Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research Libraries
 
MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Library support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted ResearchLibrary support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted Research
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
 

More from GESIS

4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
GESIS
 
Contextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabContextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living Lab
GESIS
 

More from GESIS (20)

10th BIR Workshop @ECIR 2020: introduction
10th  BIR Workshop @ECIR 2020: introduction10th  BIR Workshop @ECIR 2020: introduction
10th BIR Workshop @ECIR 2020: introduction
 
From closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journalsFrom closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journals
 
Highly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over timeHighly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over time
 
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
 
Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”
 
Searching beyond datasets in the Social Sciences
Searching beyond datasets in the Social SciencesSearching beyond datasets in the Social Sciences
Searching beyond datasets in the Social Sciences
 
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der SozialwissenschaftenBedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
 
Contextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabContextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living Lab
 
41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)
 
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
 
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
 
Challenges in Extracting and Managing References
Challenges in Extracting and Managing ReferencesChallenges in Extracting and Managing References
Challenges in Extracting and Managing References
 
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
 
Recent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information RetrievalRecent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information Retrieval
 
Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...
 
Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016
 
Recent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization SystemsRecent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization Systems
 
Using co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguationUsing co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguation
 

Recently uploaded

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 

Recently uploaded (20)

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 

Recent advances in the project EXCITE – Extraction of Citations from PDF Documents

  • 1. Recent advances in the project EXCITE – Extraction of Citations from PDF Documents Philipp Mayr GESIS – Leibniz Institute for the Social Sciences 2018-09-03, Bologna http://excite.west.uni-koblenz.de/ #opencitations #WOOC2018
  • 2. EXCITE team • PI: Steffen Staab (WeST), Philipp Mayr (GESIS) • Researchers: Behnam Ghavimi, Zeyd Boukhers • Developer: Azam Hosseini • Collaborators: Heinrich Hartmann, Martin Körner 2
  • 3. EXCITE: Background 3 • We run productive search systems and research in information retrieval, recommendation systems and knowledge discovery − SSOAR https://www.gesis.org/ssoar/ (48K full texts) − GESIS Search https://search.gesis.org/ (242K data sets + further materials) • National literatures are not well represented in major citation indices (like WoS, Scopus) • Shortage of citation data for the international and German social sciences (Social Science Citation Index is not enough) • Open availability of citation data is improving but still very limited
  • 4. EXCITE: Main objectives • Develop web services to allow third-parties to extract citation data from arbitrary publications • Develop a toolchain of reference extraction and matching software • Integrate and publish the extracted citation data in reusable formats • Narrow the supply gap of citation data in the social sciences 4
  • 5. EXCITE: toolchain (1) Extraction of text from source documents (PDFs), (2) Identification of reference sections and other forms of embedded reference information within the text, (3) Segmentation of individual references into its constituent fields such as author, title, etc., (4) Matching of reference strings against bibliographic databases, (5) Export and publication of matched references to reusable formats (convert to OCC) 5 Training data
  • 6. EXCITE: recent advances • All components are available as reusable components, see https://github.com/exciteproject • EXparser – tool to extracting and segment references (see talk by Zeyd Boukhers: “A Generic Approach for Reference Extraction from PDF Documents” tomorrow) • Annotators and Gold standards – tools to annotate references and different gold standards to train and test the tools • EXmatcher – tool to match references to bibliographic databases which base on solr, elasticsearch • EXpublisher – tool to convert EXCITE data to JSON-LD • Public demo http://excite.west.uni-koblenz.de/excite • Extracted and matched data in productive systems, e.g. https://search.gesis.org/publication/gesis-ssoar-10004 6
  • 10. EXCITE: Demo 10 Uploading File Display References Result http://excite.west.uni-koblenz.de/excite
  • 11. EXmatcher • Input are segmented reference strings with probabilities for each segment • Output are matched document ids 11
  • 12. EXmatcher hybrid approach - combination of blocking techniques and a classifier algorithm Input: strings, segments, probabilities 12
  • 13. EXPublisher • Converting extracted and matched data to the OCC ontology (incl. EXCITE Identifier in the OCI) • Enrichment of the reference information by external metadata 13https://github.com/exciteproject/EXpublisher EXMatcher and ExPublisher will be included in the demo soon!
  • 14. Next steps in EXCITE • EXCITE references to be published in OpenCitationCorpus • Public EXCITE API for testing (to be public soon) • Reference Matching to Crossref to be added in the demo/API • Gold Standards (German/English/Reference Section/Footnotes) to be completed • Extractions models for German and English texts • More Social Science data to be processed and released • We try to process ArXiv for OCC 14
  • 15. Thank you Contact: Dr Philipp Mayr GESIS - Leibniz Institute for the Social Sciences, Germany Email: philipp.mayr@gesis.org Twitter: @philipp_mayr • Project website http://excite.west.uni-koblenz.de/ • EXCITE mailing list: Subscribe to our Newsletter. • Demo http://excite.west.uni-koblenz.de/excite • GIT https://github.com/exciteproject/ 15