Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles
1. The CSO Classifier: Ontology-Driven Detection of
Research Topics in Scholarly Articles
Angelo A. Salatino, Francesco Osborne, Thiviyan Thanapalasingam, Enrico Motta
@angelosalatino
Knowledge Media Institute
The Open University
United Kingdom
3. Classifying Research Papers with their Topics
Annotating research papers allows us to:
• categorise proceedings in digital libraries
• semantically enhance the metadata of scientific publications
• generate recommendations
• produce smart analytics
• detect research trends
• …
4. Classifying Research Papers with their Topics
1) Topic detection methods
• Clustering approaches based on
citations, title, keywords
• Topic models
• Latent Dirichlet Analysis
• Author-topic models
• Supervised classifiers
2) Vocabulary-driven
Computing
Classification System
(CCS)
JEL Classification
System
Australian and New Zealand
Standard Research
Classification (ANZSRC)
6. The Computer Science Ontology
• Ontology of research areas*, automatically generated using Klink-2**
algorithm, on a dataset of 16 million publications mainly in Computer
Science
• Current version of CSO includes 14K topics and 143K relationships
• Main roots include Computer Science, Linguistic, Mathematics,
Geometry, Semantics and so on.
• Download CSO from https://cso.kmi.open.ac.uk
* Angelo A Salatino, Thiviyan Thanapalasingam, Andrea Mannocci, Francesco Osborne, Enrico Motta. "The Computer Science
Ontology: A Large-Scale Taxonomy of Research Areas." In ISWC 2018, Monterey, CA (USA).
** Francesco Osborne, and Enrico Motta. "Klink-2: integrating multiple web sources to generate semantic topic networks." In
ISWC 2015, Bethlehem, PA (USA).
8. Syntactic Module
• We split the text in unigrams, bigrams and trigrams
• For each n-gram we measure the Levenshtein similarity with the topics
in CSO
• We select CSO topics having similarity above or equal to 0.94 with n-
grams
• Helps handling plurals and hyphenated topics, such as:
• “knowledge based systems” and “knowledge-based systems”
• “database” and “databases”
10. Semantic Module
Word Embedding model
• We used titles and abstracts from 4.5M papers in Computer Science
• Pre-processed text:
• Topic replacement – “digital libraries” → “digital_libraries”
• Collocation analysis – “highest_accuracies”, “highly_cited_journals”
• Trained word2vec model
method
skipgram
emb. size
128
window size
10
negative
5
max iter.
5
min-count cutoff
10
11. Semantic Module
Entity Extraction
• POS tagger, and grammar-based chunk parser <JJ.*>*<NN.*>+
“digital libraries”
CSO concept identification
• Selects all CSO topics found in the top-10 similar words of the resulting
n-grams (with cosine similarity > 0.7)
12. Semantic Module
Concept ranking
• We assign a score to each identified topic:
• Frequency – number of times it was inferred
• Diversity – number of unique n-grams from which it was inferred
Concept Selection
• Elbow method
CSO Topic score
domain ontologies 40
semantic web 40
ontology learning 40
data mining 40
heterogeneous resources 24
semantics 24
world wide web 10
network architecture 6
scholarly communication 6
ontology matching 6
… …
14. Post Processing
Combination of output
Semantic enhancement
• We use the superTopicOf to enhance the output set
• E.g., if “machine learning” then also “artificial intelligence”
• Provides wider context for the analysed paper
• Enables analytics on high-level abstract topics (e.g., digital libraries)
15. Evaluation
• We evaluated CSO Classifier against other state-of-the-art algorithms
• TF-IDF
• LDA (with an increasing number of topics)
• previous versions of CSO Classifier
• Using a gold standard of 70 papers
Field # papers
Semantic Web 23
Natural Language Processing 23
Data Mining 24
Total 70
16. Gold Standard
• We asked 21 domain experts to annotate 10 papers (each paper got
annotated thrice)
• Each paper was annotated using 3 classifiers:
• Syntactic module
• Semantic module
• Window-based word2vec classifier
• Experts were asked to assess whether the candidate topics were relevant
or not relevant for the annotated papers
• For each paper, experts selected an average of 18 topics over 42
candidate topics (avg 0.45 Fleiss’ kappa)
• GS was built using majority rule approach
17. Evaluation
Classifier Description Prec. Rec. F1
TF-IDF TF-IDF. 16.7% 24.0% 19.7%
TF-IDF-M TF-IDF mapped to CSO concepts. 40.4% 24.1% 30.1%
LDA100 LDA with 100 topics. 5.9% 11.9% 7.9%
LDA500 LDA with 500 topics. 4.2% 12.5% 6.3%
LDA1000 LDA with 1000 topics. 3.8% 5.0% 4.3%
LDA100-M LDA with 100 topics mapped to CSO. 9.4% 19.3% 12.6%
LDA500-M LDA with 500 topics mapped to CSO. 9.6% 21.2% 13.2%
LDA1000-M LDA with 1000 topics mapped to CSO. 12.0% 11.5% 11.7%
W2V-W W2V on windows of words.
.
41.2% 16.7% 23.8%
STM Syntactic module, msm=1. 80.8% 58.2% 67.6%
SYN Syntactic module, msm=0.94. 78.3% 63.8% 70.3%
SEM Semantic module. 70.8% 72.2% 71.5%
INT Intersection of SYN and SEM. 79.3% 59.1% 67.7%
CSO-C The CSO Classifier. 73.0% 75.3% 74.1%
18. CSO Classifier adoption so far …
Since its introduction we had many industrial and academic partners that started
processing their data using the CSO Classifier:
Industry
• Springer Nature
• Dimension.ai
Universities
• CSET - George Washington University
(USA)
• FIZ Karlsruhe (DE)
• Paris 13 (FR)
• University of Trento (IT)
• University of Campinas (BR)
19. Smart Topic Miner
The Smart Topic Miner* (STM) is a semantic
application that supports the Springer Nature
editorial team in classifying scholarly
publications in the field of Computer Science.
Try me: http://stm-demo.kmi.open.ac.uk
*Angelo Salatino, Francesco Osborne, Aliaksandr Birukou, and
Enrico Motta. "Improving Editorial Workflow and Metadata
Quality at Springer Nature." In ISWC 2019. Auckland, New
Zealand.
20. Smart Topic Miner
Since its adoption at Springer Nature they experience three main benefits:
• halved the time for classifying a proceedings book – 30 min à 10-15 min
• reduced cost by 75%
• better classification increases their discoverability (+9M downloads in 3 years)
0
5000
10000
15000
20000
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Average number of yearly downloads
for books in SpringerLink
downloads (CS Proceedings) expected downloads (CS Proceedings)
downloads (CS Proceedings) withSTM downloads (other books in CS)
downloads (overall)
23. Future Work
• Working on a better performing classifier
• Using up-to-date NLP technologies: ELMO, BERT
• Large scale evaluation (high number of papers and different fields)
• Method for classifying papers when there is limited data (e.g. using citations)
• Collaboration with the FIZ Karlsruhe (Leibniz)
• Creating graph embeddings to support the current word2vec model
• Collaboration with University of Trento
• Using CSO Classifier on biomedical data
24. Thank you
Angelo Salatino
angelo.salatino@open.ac.uk
@angelosalatino
https://salatino.org
… and get in touch
References
• Angelo Salatino, Francesco Osborne, Aliaksandr
Birukou, and Enrico Motta. "Improving Editorial
Workflow and Metadata Quality at Springer
Nature." In ISWC 2019. Auckland, New Zealand.
• Angelo A Salatino, Thiviyan Thanapalasingam,
Andrea Mannocci, Francesco Osborne, Enrico
Motta. "The Computer Science Ontology: A
Large-Scale Taxonomy of Research Areas." In
ISWC 2018, Monterey, CA (USA).
• Francesco Osborne, and Enrico Motta. "Klink-2:
integrating multiple web sources to generate
semantic topic networks." In ISWC 2015,
Bethlehem, PA (USA).