SlideShare una empresa de Scribd logo
1 de 16
Methods and Tools for Scholarly Data
Analytics
Working with Scholarly APIs: A NISO Training Series
Dr Martin Szomszor
@martinszomszor
June 9th 2022
About me
• 2003-09 PhD & Research Fellow - University of Southampton
• Automatic Data Integration, Semantic Web, Knowledge Engineering, Natural Language Processing
• 2009-11 Deputy Head of Centre - City eHealth Research Centre, London
• Epidemic intelligence using social media
• ECDC Field Epidemiology Manual Wiki
• 20011-18 Chief Data Scientist – Digital Science
• Global Research Identifier Database (GRID) -> ROR
• HEFCE Impact Case Study Database (REF2014)
• Springer Nature SciGraph
• 2018-21 Director - Institute for Scientific Information, Clarivate
• R&D for new Web of Science features including Author Impact Beam Plots, Impact Profiles, Citation Topics and the new Journal Citation Indicator
• Produced the annual Highly Cited Researcher list
• Various reports and academic papers on research management, bibliometrics, research integrity, emerging research fronts, etc…
• 2021-22 Electric Data Solutions LTD
• Custom Research Analytics for institutions, funders, publishers and charities
• Impact Case Study Database for Hong Kong RAE2020
• Google Scholar link
Typical Data Science workflow
Formulate Questions Discover data sources
• Publication data
• Grant award data
• Patent data
• Policy mentions
• Impact Case Studies
Link & enhance the data
• Text analysis
• Bibliometrics
• Geoparsing
Analytics
• Visualizations
• Metrics / Indicators
• Research Topic
• Funders
• Collaborations
• Disciplines
• Journals
• Open Access
Data Linking
• Persistent identifiers provide the best way to link data across different sources:
• DOI / ISBN for scholarly works
• GRID / ROR for institutions
• Crossref Funder Registry for funders
• ORCID for researchers
• Geonames for geography
• Many provide services to help disambiguate unstructured data:
• Crossref Simple Text Query will find DOIs for reference text
• ROR API to match author affiliations to relevant institutions
Data Enhancement
• For many analytical scenarios, it is beneficial to integrate data from multiple taxonomies or classifications that provide complimentary views
• Research may be classified using the following approaches, each of which will provide a different interpretation
• Journal Classifications (Web of Science Subject Categories, Scopus All Science Journal Classification)
• Article-level classification (Dimensions Fields of Research, InCites Citation Topics)
• Text mining (Topic Modelling, sentiment analysis)
• … and many others (OECD Frascati Manual, Funder specific taxonomies)
• Choice will depend on the research question
• Time-scale (e.g. is long term stability required or are more dynamic and contemporary topics required?)
• Granularity (is the classification suitably broad or narrow?)
• Coverage (is the data covered sufficiently well?)
• Combinations may provide insights into different aspects of the research:
• Journal Classifications align well with concept disciplinary origin
• Custom topic models can provide more nuanced views of the research type, domain of application or beneficiary group
• Bibliometric approaches are useful in the identification of research communities
• Other important aspects may include
• Collaboration mode (domestic versus international)
• Collaborating partners (industrial, healthcare etc)
• Open Access type (Gold, Hybrid, Green, etc)
Tutorial
• To demonstrate these concepts, I have created some Python notebooks to show how data can be
extracted, enhanced and analyzed
• For this purpose, I have chosen the research topic “Dengue Fever”
• Often categorized as a neglected tropical disease, it has growing research interest as cases continue to rise and
more countries see infections
• Relevant to UN Sustainable Development Goals
• Includes research on many topics including climate change, socio-economic development, migration, pharmaceutical
development, etc.
• It is a research topic where geography is important
• It is only a toy example – many crucial steps (i.e. proper search methodology and data curation) have
been omitted so the results are only for illustrative purposes
Geographical distribution of dengue cases reported worldwide, 2020
https://www.ecdc.europa.eu/en/publications-data/geographical-distribution-dengue-cases-reported-worldwide-2020
Technical Requirements
• Code for this presentation can be found on Github:
• https://github.com/martinszomszor/2022_NISO
• I have also included the data files so you can play around with analysis part without reprocessing the source data
• If you have Python and pip installed, you can easily install the required libraries using:
• pip install –r requirements.txt
• You will also need to download and extract the Edinburgh Geoparser from
• https://www.ltg.ed.ac.uk/software/geoparser/
• I use Gephi for network visualization, software available for Mac / Windows / Linux
• https://gephi.org/
Python Jupyter Notebooks
• These Notebooks should be executed in order to run the various step of the analysis:
• 01 Download Publication Data.ipynb
• Use OpenAlex API to extract publications related to “Dengue Fever”
• 02 Geoparse.ipynb
• Use the OpenSource Edinburgh Geoparser to identify locations mentioned in the title or abstract
• 03 Bibliographic Coupling.ipynb
• Extract cited references from publication data and use bibliographic coupling to cluster related publications
• Vizualise the Network using Gephi
• 04 Topic Modelling.ipynb
• Create a simple topic model using title and abstracts
• Visualize the publication dataset using UMAP
• 05 Analysis.ipynb
• Join the data files and produce some simple analysis
01 Download Publication Data.ipynb
• You may have a subscription to a bibliometric database, such as Web of Science, Scopus, or
Dimensions, but if not, there is also the option to use OpenAlex
02 Geoparse.ipynb
• Geoparsing – the identification and disambiguation of
locations mentioned in text
• Used to annotate REF2014 impact case studies with
impact locations
• When applied to research publications on diseases, it
can help identify where studies have taken place
03 Bibliographic Coupling.ipynb
• The citation network contains information
about the nature of research
publications
• One common technique is to calculate
publication similarity based on the
overlap in cited works (bibliographic
coupling)
• This image is taken from an ISI report on
UN SDGs
04 Topic Modelling.ipynb
• Topic modelling is a natural language processing
(NLP) technique that determines how certain
clusters of related words (topics) can be used to
categorise the underlying data.
• It is a data driven approach, meaning results are
not dependent on pre-conceived notions of
structure, but are instead derived from the data
itself
• This image from a 2017 report shows Arts &
Humanities research grants and their various
topics
05 Analysis.ipynb
• How frequently are locations mentioned in research relating to Dengue Fever?
• Is research focused in areas with increased incidence?
• Do certain locations have more or less research on a particular topic?
• Are publications that mention certain countries authored by researches based on those
countries?
Final remarks
• There are many public data sources that can be linked with scholarly data to provide new
analytical opportunities
• World Bank
• World Health Organization
• European Union Open Data Portal
• UNICEF Datasets
• Data Science techniques, such as Geoparsing and Topic Modelling, can be used to enhance
scholarly data and provide insights to research questions
Thanks for your attention
martin@electricdata.solutions
Find out more at electricdata.solutions

Más contenido relacionado

Similar a Methods for Scholarly Data Analytics Using APIs and NLP

Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13DataDryad
 
20190527_Karen Hytteballe Ibanez _ The OPERA project
 20190527_Karen Hytteballe Ibanez _ The OPERA project 20190527_Karen Hytteballe Ibanez _ The OPERA project
20190527_Karen Hytteballe Ibanez _ The OPERA projectOpenAIRE
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 Scott Edmunds
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptxJitha Kannan
 
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...NASIG
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014Susanna-Assunta Sansone
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...SEAD
 
Managing Big Data - Berlin, July 9-10, 201.
Managing Big Data - Berlin, July 9-10, 201.Managing Big Data - Berlin, July 9-10, 201.
Managing Big Data - Berlin, July 9-10, 201.Susanna-Assunta Sansone
 
Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries?Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries? Robin Rice
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
 
Data sharing as part of the research workflow
Data sharing as part of the research workflowData sharing as part of the research workflow
Data sharing as part of the research workflowVarsha Khodiyar
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...ALISS
 
Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...
Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...
Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...Platforma Otwartej Nauki
 
Forms of cooperation between National Statistical Institutes and data archive...
Forms of cooperation between National Statistical Institutes and data archive...Forms of cooperation between National Statistical Institutes and data archive...
Forms of cooperation between National Statistical Institutes and data archive...Arhiv družboslovnih podatkov
 
Connecting the pieces: using ORCIDs to improve research impact and repositories
Connecting the pieces: using ORCIDs to improve research impact and repositoriesConnecting the pieces: using ORCIDs to improve research impact and repositories
Connecting the pieces: using ORCIDs to improve research impact and repositoriesORCID, Inc
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesIUPUI
 

Similar a Methods for Scholarly Data Analytics Using APIs and NLP (20)

Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
 
Part 1 Research workshop
Part 1 Research workshopPart 1 Research workshop
Part 1 Research workshop
 
20190527_Karen Hytteballe Ibanez _ The OPERA project
 20190527_Karen Hytteballe Ibanez _ The OPERA project 20190527_Karen Hytteballe Ibanez _ The OPERA project
20190527_Karen Hytteballe Ibanez _ The OPERA project
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
empirical-SLR.pptx
empirical-SLR.pptxempirical-SLR.pptx
empirical-SLR.pptx
 
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
Capturing and Analyzing Publication, Citation and Usage Data for Contextual C...
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
Managing Big Data - Berlin, July 9-10, 201.
Managing Big Data - Berlin, July 9-10, 201.Managing Big Data - Berlin, July 9-10, 201.
Managing Big Data - Berlin, July 9-10, 201.
 
ESRA presentation of ADP SORS cooperation
ESRA presentation of ADP SORS cooperationESRA presentation of ADP SORS cooperation
ESRA presentation of ADP SORS cooperation
 
Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries?Research data support: a growth area for academic libraries?
Research data support: a growth area for academic libraries?
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
Data sharing as part of the research workflow
Data sharing as part of the research workflowData sharing as part of the research workflow
Data sharing as part of the research workflow
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
 
Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...
Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...
Linking Heterogeneous Scholarly Data Sources in an Interoperable Setting: the...
 
Forms of cooperation between National Statistical Institutes and data archive...
Forms of cooperation between National Statistical Institutes and data archive...Forms of cooperation between National Statistical Institutes and data archive...
Forms of cooperation between National Statistical Institutes and data archive...
 
Connecting the pieces: using ORCIDs to improve research impact and repositories
Connecting the pieces: using ORCIDs to improve research impact and repositoriesConnecting the pieces: using ORCIDs to improve research impact and repositories
Connecting the pieces: using ORCIDs to improve research impact and repositories
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
 

Más de National Information Standards Organization (NISO)

Más de National Information Standards Organization (NISO) (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"Bazargan "NISO Webinar, Sustainability in Publishing"
Bazargan "NISO Webinar, Sustainability in Publishing"
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 
Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"Compton "NISO Webinar, Sustainability in Publishing"
Compton "NISO Webinar, Sustainability in Publishing"
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
Hazen, Morse, and Varnum "Spring 2024 ODI Conformance Statement Workshop for ...
 
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
Mattingly "AI & Prompt Design" - Introduction to Machine Learning"
 
Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"Mattingly "Text and Data Mining: Building Data Driven Applications"
Mattingly "Text and Data Mining: Building Data Driven Applications"
 
Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"Mattingly "Text and Data Mining: Searching Vectors"
Mattingly "Text and Data Mining: Searching Vectors"
 
Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"Mattingly "Text Mining Techniques"
Mattingly "Text Mining Techniques"
 
Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"Mattingly "Text Processing for Library Data: Representing Text as Data"
Mattingly "Text Processing for Library Data: Representing Text as Data"
 
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
Carpenter "Designing NISO's New Strategic Plan: 2023-2026"
 
Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"Ross and Clark "Strategic Planning"
Ross and Clark "Strategic Planning"
 
Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"Mattingly "Data Mining Techniques: Classification and Clustering"
Mattingly "Data Mining Techniques: Classification and Clustering"
 
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...Straza "Global collaboration towards equitable and open science: UNESCO Recom...
Straza "Global collaboration towards equitable and open science: UNESCO Recom...
 
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
Lippincott "Beyond access: Accelerating discovery and increasing trust throug...
 
Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"Kriegsman "Integrating Open and Equitable Research into Open Science"
Kriegsman "Integrating Open and Equitable Research into Open Science"
 
Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"Mattingly "Ethics and Cleaning Data"
Mattingly "Ethics and Cleaning Data"
 
Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"Mercado-Lara "Open & Equitable Program"
Mercado-Lara "Open & Equitable Program"
 

Último

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Último (20)

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

Methods for Scholarly Data Analytics Using APIs and NLP

  • 1. Methods and Tools for Scholarly Data Analytics Working with Scholarly APIs: A NISO Training Series Dr Martin Szomszor @martinszomszor June 9th 2022
  • 2. About me • 2003-09 PhD & Research Fellow - University of Southampton • Automatic Data Integration, Semantic Web, Knowledge Engineering, Natural Language Processing • 2009-11 Deputy Head of Centre - City eHealth Research Centre, London • Epidemic intelligence using social media • ECDC Field Epidemiology Manual Wiki • 20011-18 Chief Data Scientist – Digital Science • Global Research Identifier Database (GRID) -> ROR • HEFCE Impact Case Study Database (REF2014) • Springer Nature SciGraph • 2018-21 Director - Institute for Scientific Information, Clarivate • R&D for new Web of Science features including Author Impact Beam Plots, Impact Profiles, Citation Topics and the new Journal Citation Indicator • Produced the annual Highly Cited Researcher list • Various reports and academic papers on research management, bibliometrics, research integrity, emerging research fronts, etc… • 2021-22 Electric Data Solutions LTD • Custom Research Analytics for institutions, funders, publishers and charities • Impact Case Study Database for Hong Kong RAE2020 • Google Scholar link
  • 3. Typical Data Science workflow Formulate Questions Discover data sources • Publication data • Grant award data • Patent data • Policy mentions • Impact Case Studies Link & enhance the data • Text analysis • Bibliometrics • Geoparsing Analytics • Visualizations • Metrics / Indicators • Research Topic • Funders • Collaborations • Disciplines • Journals • Open Access
  • 4. Data Linking • Persistent identifiers provide the best way to link data across different sources: • DOI / ISBN for scholarly works • GRID / ROR for institutions • Crossref Funder Registry for funders • ORCID for researchers • Geonames for geography • Many provide services to help disambiguate unstructured data: • Crossref Simple Text Query will find DOIs for reference text • ROR API to match author affiliations to relevant institutions
  • 5. Data Enhancement • For many analytical scenarios, it is beneficial to integrate data from multiple taxonomies or classifications that provide complimentary views • Research may be classified using the following approaches, each of which will provide a different interpretation • Journal Classifications (Web of Science Subject Categories, Scopus All Science Journal Classification) • Article-level classification (Dimensions Fields of Research, InCites Citation Topics) • Text mining (Topic Modelling, sentiment analysis) • … and many others (OECD Frascati Manual, Funder specific taxonomies) • Choice will depend on the research question • Time-scale (e.g. is long term stability required or are more dynamic and contemporary topics required?) • Granularity (is the classification suitably broad or narrow?) • Coverage (is the data covered sufficiently well?) • Combinations may provide insights into different aspects of the research: • Journal Classifications align well with concept disciplinary origin • Custom topic models can provide more nuanced views of the research type, domain of application or beneficiary group • Bibliometric approaches are useful in the identification of research communities • Other important aspects may include • Collaboration mode (domestic versus international) • Collaborating partners (industrial, healthcare etc) • Open Access type (Gold, Hybrid, Green, etc)
  • 6. Tutorial • To demonstrate these concepts, I have created some Python notebooks to show how data can be extracted, enhanced and analyzed • For this purpose, I have chosen the research topic “Dengue Fever” • Often categorized as a neglected tropical disease, it has growing research interest as cases continue to rise and more countries see infections • Relevant to UN Sustainable Development Goals • Includes research on many topics including climate change, socio-economic development, migration, pharmaceutical development, etc. • It is a research topic where geography is important • It is only a toy example – many crucial steps (i.e. proper search methodology and data curation) have been omitted so the results are only for illustrative purposes
  • 7. Geographical distribution of dengue cases reported worldwide, 2020 https://www.ecdc.europa.eu/en/publications-data/geographical-distribution-dengue-cases-reported-worldwide-2020
  • 8. Technical Requirements • Code for this presentation can be found on Github: • https://github.com/martinszomszor/2022_NISO • I have also included the data files so you can play around with analysis part without reprocessing the source data • If you have Python and pip installed, you can easily install the required libraries using: • pip install –r requirements.txt • You will also need to download and extract the Edinburgh Geoparser from • https://www.ltg.ed.ac.uk/software/geoparser/ • I use Gephi for network visualization, software available for Mac / Windows / Linux • https://gephi.org/
  • 9. Python Jupyter Notebooks • These Notebooks should be executed in order to run the various step of the analysis: • 01 Download Publication Data.ipynb • Use OpenAlex API to extract publications related to “Dengue Fever” • 02 Geoparse.ipynb • Use the OpenSource Edinburgh Geoparser to identify locations mentioned in the title or abstract • 03 Bibliographic Coupling.ipynb • Extract cited references from publication data and use bibliographic coupling to cluster related publications • Vizualise the Network using Gephi • 04 Topic Modelling.ipynb • Create a simple topic model using title and abstracts • Visualize the publication dataset using UMAP • 05 Analysis.ipynb • Join the data files and produce some simple analysis
  • 10. 01 Download Publication Data.ipynb • You may have a subscription to a bibliometric database, such as Web of Science, Scopus, or Dimensions, but if not, there is also the option to use OpenAlex
  • 11. 02 Geoparse.ipynb • Geoparsing – the identification and disambiguation of locations mentioned in text • Used to annotate REF2014 impact case studies with impact locations • When applied to research publications on diseases, it can help identify where studies have taken place
  • 12. 03 Bibliographic Coupling.ipynb • The citation network contains information about the nature of research publications • One common technique is to calculate publication similarity based on the overlap in cited works (bibliographic coupling) • This image is taken from an ISI report on UN SDGs
  • 13. 04 Topic Modelling.ipynb • Topic modelling is a natural language processing (NLP) technique that determines how certain clusters of related words (topics) can be used to categorise the underlying data. • It is a data driven approach, meaning results are not dependent on pre-conceived notions of structure, but are instead derived from the data itself • This image from a 2017 report shows Arts & Humanities research grants and their various topics
  • 14. 05 Analysis.ipynb • How frequently are locations mentioned in research relating to Dengue Fever? • Is research focused in areas with increased incidence? • Do certain locations have more or less research on a particular topic? • Are publications that mention certain countries authored by researches based on those countries?
  • 15. Final remarks • There are many public data sources that can be linked with scholarly data to provide new analytical opportunities • World Bank • World Health Organization • European Union Open Data Portal • UNICEF Datasets • Data Science techniques, such as Geoparsing and Topic Modelling, can be used to enhance scholarly data and provide insights to research questions
  • 16. Thanks for your attention martin@electricdata.solutions Find out more at electricdata.solutions