One year ago we started ingesting citation data from the Open Access literature into the OpenCitations Corpus (OCC), creating an RDF dataset of scholarly citation data that is open to all. In this presentation we introduce the OCC and we discuss its outcomes and uses after the first year of life.
2. What is a citation?
n The performative act of citing a published work that is relevant to the current
work, typically made by including a reference in a reference list
Why are citations important?
n The act of bibliographic citation is central to scholarly communication –
bibliographic references are the links that knit together independent scholarship
n Citations unify the whole world of scholarship into a giant citation network
n Citation networks reveal the development of academic disciplines
n Sir Isaac Newton:
“If I have seen a little further, it is by standing on the shoulders of Giants”
3. How is the present situation imperfect?
n The present scholarly citation system inadequately exposes the knowledge
networks that exist within the scholarly literature
n Citation data are hidden behind subscription firewalls of commercial companies
n Academics are not free to use their own citation data as they please
n In this Open Access age, it is a scandal that reference lists from journal articles,
the core elements of the academic data cycle, are not freely available for use by
the scholars who created them
n Citation data now need to be recognized as a part of the Commons – those
works that are freely and legally available for sharing
n To address this issue, we have developed The OpenCitations Corpus
4. How this came about - 2009 adventures in semantic publishing
5. The SPAR (Semantic Publishing and Referencing) Ontologies
FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for
describing bibliographic entities (books, articles, etc.)
CiTO, the Citation Typing Ontology - enable characterization of citations,
both factually and rhetorically
BiRO, the Bibliographic Reference Ontology - an ontology to define
bibliographic records and references, and their compilation into bibliographic
collections and reference lists, respectively
C4O, the Citation Counting and Context Characterization Ontology
DoCO, the Document Components Ontology
PRO, the Publishing Roles Ontology
PSO, the Publishing Status Ontology
PWO, the Publishing Workflow . . . and now others
http://www.sparontologies.net/
6. The OpenCitations Corpus
n The OpenCitations Corpus is a Linked Open Data repository of scholarly
bibliographic citation data described using the SPAR ontologies
n Prototype created at Oxford in 2011 by Alex Dutton with JISC funding
n A new instantiation created by Silvio at the University of Bologna in late 2015
§ based on a revised metadata schema, with automated daily ingestion of
citations from authoritative sources
n OCC now provides the largest RDF collection of open citation data on the Web
§ currently holds the references from ~150,000 citing bibliographic resources
§ providing ~6.7 million citation links to over 4 million cited resources
n These citations are encoded using the SPAR ontologies, and are freely available
under a CC0 public domain waiver from http://opencitations.net/
n The OpenCitations Enhancement Project has just been funded
by the Sloan Foundation, to enhance ingest rates and provide
smart data visualization interfaces
7. Ingestion workflow
n We developed several scripts for implementing the ingestion workflow that
populates the OpenCitations Corpus
n All the software is available on the OpenCitations GitHub repository
https://github.com/essepuntato/opencitations
§ Released as open source code with the ISC License
https://opensource.org/licenses/ISC
n These scripts implement a live and iterative process
n Why live?
§ It is working while I’m speaking
§ It does not sleep, never
§ It is like a sentient, relentless, fast zombie – watch out!
n Why iterative?
The ingestion workflow continuously calls several external APIs to obtain new
reference lists and clean metadata of the citing and cited papers
O C
8. Reference lists from PubMed Central
n At present, all the reference lists are taken by processing the XML sources of
the papers in the PubMed Central Open Access subset
n We use the Europe PubMed Central API for retrieving the XML sources
§ We ask for the most recent papers first
§ Thus, as citing papers, the OCC mainly includes articles published in
2016 and 2017
n There are 1.58M OA articles available in PubMed, according to their API
http://www.ebi.ac.uk/europepmc/webservices/rest/search?query=open_access:y
§ We have harvested 10% so far . . .
n The identifiers of all the citing papers that we have been already processed by
the ingestion workflow are stored locally, so as not to request the same XML
source twice
9. Metadata from Crossref and ORCID
n The reference lists extracted from citing papers are made available in JSON:
{
"doi": "10.1007/s11892-016-0752-4",
"pmid": "27168063",
"pmcid": "PMC4863913",
"localid": "MED-27168063",
"curator": "BEE EuropeanPubMedCentralProcessor",
"source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4863913/fullTextXML",
"source_provider": "Europe PubMed Central”
"references": [
...
{
"bibentry": "Chang, KY, Unanue, ER. Prediction of HLA-DQ8beta cell peptidome using
a computational program and its relationship to autoreactive T cells,
Int Immunol, 2009, 21, 6, 705, 13, DOI: 10.1093/intimm/dxp039,
PMID: 19461125",
"pmid": "19461125",
"doi": "10.1093/intimm/dxp039",
"pmcid": "PMC2686615",
"process_entry": "True”
},
...
]
}
n We then call the Crossref APIs to obtain additional information (title, authors,
venues, etc.) about the citing paper and about those papers described in the
reference list, and then call the ORCID APIs to obtain ORCIDs of the authors
The citing paper's metadata and identifiers
A reference in the citing paper's reference list, with its own ids
10. The OpenCitations Corpus data model
n Available at https://doi.org/10.6084/m9.figshare.3443876
n Implemented in the OpenCitations Ontology (OCO, https://w3id.org/oc/ontology)
§ It is not yet another bibliographic ontology, but rather simply a mechanism
for grouping together existing complementary ontological entities from
several other ontologies (e.g. SPAR and FOAF)
11. Resources included within the Corpus (as of 26 April 2017)
Entity type What it describes Count in the OCC
Bibliographic resource (br) Conference papers, book chapters, journal articles,
academic proceedings, books, journals, etc.
5.1 million
Resource embodiment (re) Digital vs. print, first and ending pages, etc. 2.9 million
Bibliographic entry (be) Textual content of a reference in a reference list 6 million
Responsible agent (ra) Given name, family name and ORCID of the agent
involved
15.8 million
Agent role (ar) Author, publisher, etc. 20 million
Identifier (id) DOI, PubMed ID, PubMed Central ID, ORCID, ISSN, etc. 10.4 million
12. OpenCitations in the wild
n Twitter: https://twitter.com/opencitations
n Blog: https://opencitations.wordpress.com
n The data in the OpenCitations Corpus are available in three different ways:
§ Direct access to bibliographic resources by means of their HTTP URIs
(via content negotiation, e.g. https://w3id.org/oc/corpus/br/1)
§ SPARQL endpoint: https://w3id.org/oc/sparql
§ Monthly dumps: http://opencitations.net/download (stored in Figshare)
Figshare statistics as of 8 May 2017
13. Third-party usage of OpenCitations
n Projects that use OpenCitations resources:
§ Wikidata
§ OpenAIRE
§ LOC-DB
§ Others? Please let us know!
n Accesses to the OpenCitations website and services:
The pages relating to the data available (“corpus”) and the service for querying them (“sparql”)
have together gained 88% of the overall accesses, showing that the main reason people access
the OpenCitations website is to explore and use the data in the OpenCitations Corpus
14. What happened in the past month
n Use of the OpenCitations social accounts (Twitter, Blog on Wordpress)
increased markedly during the past month
n What happened?
15. Initiative for Open Citations (I4OC)
n The Initiative for Open Citations (I4OC, https://i4oc.org) is a collaboration
between scholarly publishers, researchers, and other interested parties to
promote the unrestricted availability of scholarly citation data
n Founders:
n Aim: promote the availability of structured, separable, and open citation data
n How: asking publishers
§ to submit article metadata (including reference lists) to Crossref Cited-by
service
§ to allow Crossref to open the reference lists to the public
n Achievement: as of March 2017, publications with open references freely
available in Crossref has grown from 1% to more than 40%
OpenCitations is
one of the
founder
16. The OpenCitations ingestion rate: an update
About 500,000 new
citations links added
per month
per day
New infrastructure coming soon
(thanks to the OpenCitations Enhancement Project
just funded by the Sloan Foundation)
The OpenCitations will have ~190 million citation links
after one year of processing with the new infrastructure
17. Thank you for your attention
david.shotton@opencitations.net
David Shotton Silvio Peroni
silvio.peroni@opencitations.net
Website: http://opencitations.net
Email: contact@opencitations.net
Twitter: @opencitations
Blog: https://opencitations.wordpress.com
Github: https://github.com/essepuntato/opencitations
Contacts