Research Knowledge Graphs (RKGs) can help address challenges in data science like reproducibility and bias by making relationships between scientific resources like data, publications, and methods explicit and machine-interpretable. GESIS is constructing large-scale RKGs using natural language processing and deep learning methods to extract knowledge graphs about software and data usage from millions of publications. These RKGs power semantic search and enable new social science research using datasets like TweetsKB, which contains over 10 billion annotated tweets. The NFDI4DS aims to build a joint RKG by connecting existing RKGs through common standards and identifiers.
2. NFDI4DS - Consortium
2
FHI
Universität
Leipzig
Several universities as centers of
excellence in DS and AI.
Several non-university research
institutes as key contributors.
Scientific information centers
and infrastructure providers as
providers of data and for access to
domain user groups.
3. ● Paradigm shift in computer science towards data- and deep
learning driven methods (DS = fourth scientific paradigm).
● Reproducibility crisis.
○ Study on neural recommender publications at top-tier-
venues finds only 7/18 reproducible and only 1/18 beating
state-of-the-art. [Dacrema et al., RecSys2019]
● Bias and discrimination from learned data-induced biases.
○ Gender classification performance of neural approaches
performs significantly worse for darker skinned women.
[Buolamwini et al., FAccT2018]
○ Health risk of black people consistently underestimated
by predictive algorithms. [Obermeyer et al., Science2019]
Data Science & AI Challenges
3
4. ▪ Research Data
▪ Publications
▪ Code/Scripts
▪ ML Models
▪ Methods
▪ Claims
▪ Metrics
Relations between scientific resources, data, knowledge Research Data Cycle
Provenance & Dependencies of Research Data, Resources, Knowledge
5. Common questions for researchers
• Which top-tier publications cite which data/method?
(„dataset authority“)
• Which data was used to train/evaluate which method?
Which method to produce what data?
• Which claims are supported/cited/rejected by what
dataset or publication?
▪ Research Data
▪ Publications
▪ Code/Scripts
▪ ML Models
▪ Methods
▪ Claims
▪ Metrics
Relations between scientific resources, data, knowledge
Provenance & Dependencies of Research Data, Resources, Knowledge
6. Challenges
• Data & metadata about resources and concepts not
represented in structured, machine-interpretable,
integrated manner (hidden in publications, web pages
etc)
• Persistent identifiers (e.g. DOIs) used inconsistently
(e.g. on publications/datasets)
• Relations and semantics not explicit
• Reproducibility crisis in CS/DS/AI
▪ Research Data
▪ Publications
▪ Code/Scripts
▪ ML Models
▪ Methods
▪ Claims
▪ Metrics
Relations between scientific resources, data, knowledge
Provenance & Dependencies of Research Data, Resources, Knowledge
7. • Data interoperability and reuse through established W3C standards for data
sharing (on the Web), e.g. RDF, JSON, shared vocabularies
(e.g. schema.org, DCAT, DDI), APIs for data reuse and linking
• Making links between resources and concepts explicit & machine-
interpretable
(e.g. which publications cite what dataset?)
• Consistent use of persisent IDs (e.g. URIs, DOIs) across all data, e.g.
concepts, resources etc („DOIs for all“)
• Partners in NFDI4DS/TA2 (“RKGs”)
Knowledge Graphs for FAIR Research Data in NFDI4DS
Resources
▪ Datasets
▪ Publications
▪ Code
▪ Software
Concepts
▪ Terms &
Definitions
▪ Claims
▪ Methods
▪ Topics
▪ Entities
8. GESIS & TIB RKGs in NFDI4DS
10
• Community
• Expert-curation/annotation
workflow/tools
• Focus point & hub
• PIDs
https://www.nfdi4datascience.de/
• Large-scale knowledge
graphs
• Automated deep learning-
based methods for
extracting KGs
• Populating ORKG
Resources
▪ Datasets
▪ Publications
▪ Code
▪ Software
Concepts
▪ Terms &
Definitions
▪ Claims
▪ Methods
▪ Topics
▪ Entities
9. 11
Research KGs in Practice: integrated search @ GESIS
https://search.gesis.org/
Dataset
Rel. Publications
10. 12
From publications to machine-interpretable metadata KGs
Disambiguation of dataset & software/script citations
https://data.gesis.org/softwarekg
▪ Training deep learning-
based model for extraction
software & data references
in large-scale data
(3.5 M publications)
▪ Data lifting into KG
▪ 300+ M triples / statements
▪ Search across
data/software/publications
(GESIS Search)
11. From publications to machine-interpretable metadata KGs
Understanding scientific software/data usage
https://data.gesis.org/softwarekg
(Schindler et al., CIKM2021)
▪ Understanding SW
usage, citation habits
and their evolution
across disciplines
▪ Rise of data science =
rise of software usage
13. https://data.gesis.org/softwarekg
▪ Top adopters of data
science/AI/software…
▪ …follow the worst
citation habits
From publications to machine-interpretable metadata KGs
Understanding scientific software/data usage
15. From social media to machine-interpretable research data KGs
https://data.gesis.org/tweetskb
TweetsKB – a large-scale research KG of societal opinions
▪ Harvesting & archiving of 10 Billion tweets
(permanent collection from Twitter 1% sample since
2013)
▪ Information extraction pipeline to build a KG of
entities, interactions & sentiments
(distributed batch processing via Hadoop
Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(“president”/“potus”/”trump” =>
dbp:DonaldTrump)
o Sentiment analysis/annotation
o Geotagging
o Lifting into knowledge graph schema
▪ Public, privacy-aware, large-scale research corpus
of public opinions and their evolution
=> interdisciplinary research
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
16. RKG-based social science research using TweetsKB
https://dd4p.gesis.org
Investigating Vaccine Hesitancy in DACH countries
17. Germany suspends
vaccinations with Astra
Zeneca
Twitter discourse zu “Impfbereitschaft”
RKG-based social science research using TweetsKB
Investigating Vaccine Hesitancy in DACH countries
https://dd4p.gesis.org
18. RKG-based discourse analysis using TweetsKB
Vaccine Hesitancy– key topics in “safety” category
„Schwangerschaft“ „Kimmich“
„Alter“
„Nebenwirkungen“
„Herzinfarkt“
„Zulassung“
https://dd4p.gesis.org
19. Summary: Research KGs @ GESIS
21
Tools for constructing scholarly knowledge graphs
● NLP and deep learning-powered methods for extracting large-scale KGs
about methods, claims, data, software involved in the scientific process
Large-scale scholarly KGs, e.g.
● KGs about scholarly use of software & research data
(e.g. SoftwareKG: 1.8 M disambiguated software mentions extracted
from 3 M publications, https://data.gesis.org/softwarekg/)
● Web mined KGs of social science research data, e.g. public opinions,
claims and attitudes expressed on social media
(e.g. TweetsKB: > 10 Bn semantically annotated tweets, sentiments,
https://data.gesis.org/tweetskb)
Semantic Search powered by KGs and related tools
● RKG-powered search across scholarly publications, datasets, methods
and their relations (e.g. GESIS Search, https://search.gesis.org)
https://gesis.org/en/kts
https://search.gesis.org
20. Outlook: a joint Research KG in NFDI4DS
22
Resources
▪ Datasets
▪ Publications
▪ Code
▪ Software
Concepts
▪ Terms &
Definitions
▪ Claims
▪ Methods
▪ Topics
▪ Entities
• Community
• Expert-curation/annotation
workflow/tools
• Focus point & hub
• PIDs
https://www.nfdi4datascience.de/
• Large-scale knowledge
graphs
• Automated deep learning-
based methods for
extracting KGs
• Populating ORKG