This document discusses improving scholarly communication through knowledge graphs. It describes some current issues with scholarly communication like lack of structure, integration, and machine-readability. Knowledge graphs are proposed as a solution to represent scholarly concepts, publications, and data in a structured and linked manner. This would help address issues like reproducibility, duplication, and enable new ways of exploring and querying scholarly knowledge. The document outlines a ScienceGRAPH approach using cognitive knowledge graphs to represent scholarly knowledge at different levels of granularity and allow for intuitive exploration and question answering over semantic representations.
Towards Knowledge Graph based Representation, Augmentation and Exploration of Scholarly Communications
1. Prof. Dr. Sören Auer
Faculty of Electrical Engineering & Computer Science
Leibniz University of Hannover
TIB Technische Informationsbibliothek
Towards Knowledge Graph based
Representation, Augmentation and
Exploration of Scholarly Communications
2. Page 2
Zuse Z3: the
beginning of
Computing –
close to the
hardware
Foto: Konrad Zuse
Internet
Archiv/Deutsches
Museum/DFG
4. Page 4
We can make things
more intuitive
Picture: The illustrated recipes
of lucy eldridge
http://thefoxisblack.com/2013/0
7/18/the-illustrated-recipes-of-
lucy-eldridge/
10. Page 10
Linked Data Principles
Addressing the neglected third V (Variety)
1. Use URIs to identify the “things” in your data
2. Use http:// URIs so people (and machines) can look them up
on the web
3. When a URI is looked up, return a description of the thing in
the W3C Resource Description Format (RDF)
4. Include links to related things
http://www.w3.org/DesignIssues/LinkedData.html
11. Page 11
1. Graph based RDF data model consisting of S-P-O statements (facts)
2. Serialised as RDF Triples:
Et-Inf conf:organizes Antrittsvorlesung2019 .
Antrittsvorlesung2019 conf:starts “2019-20-07”^^xsd:date .
Antrittsvorlesung2019 conf:takesPlaceAt dbpedia:Hannover .
3. Publication under URL in Web, Intranet, Extranet
RDF & Linked Data in a Nutshell
Antritts-
vorlesung2019
dbpedia:Hannover
20.05.2019
Et-Inf
conf:organizes conf:starts
conf:takesPlaceInSubject Predicate Object
12. Page 12
Creating Knowledge Graphs with RDF
Linked Data
DHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
??
located in
label
industry
headquarters
height
label
full name
located in
label
industry
headquarters
full nameDHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
height
物流
label
13. Page 13
Fabric of concept, class, property, relationships, entity desc.
Uses a knowledge representation formalism (RDF, OWL)
Holistic knowledge (multi-domain, source, granularity):
instance data (ground truth),
open (e.g. DBpedia, WikiData), private (e.g. supply chain data), closed data (product models),
derived, aggregated data,
schema data (vocabularies, ontologies)
meta-data (e.g. provenance, versioning, documentation licensing)
comprehensive taxonomies to categorize entities
links between internal and external data
mappings to data stored in other systems and databases
Knowledge Graphs – A definition
15. WDAqua project vision
● Answer natural language
questions
● Exploit knowledge encoded in
the Web of Data
● Provide QA services to
citizens, communities, and
industry
15
Q
A
Web of Data
17. Who is the director of
Clockwork Orange?
17
Understand a
spoken question
18. Who is the director of
Clockwork Orange?
18
Understand a
spoken question
Analyse
question
19. Who is the director of
Clockwork Orange?
19
Understand a
spoken question
Analyse
question
Find data to
answer the
question
20. Who is the director of
Clockwork Orange?
20
Understand a
spoken question
Analyse
question
Find data to
answer the
question
Present the
answer
21. Who is the director of
Clockwork Orange?
21
Understand a
spoken question
Analyse
question
Find data to
answer the
question
Present the
answer
Data
source:
22. 22
Which publications and
health reports are
related to Alzheimer in
Greece?
Understand a
spoken question
Analyse
question
Find data to
answer the
question
Present the
answer
23. 23
Which publications and
health reports are
related to Alzheimer in
Greece?
Understand a
spoken question
Analyse
question
Find data to
answer the
question
Present the
answer
Data
sources
:
24. WDAquaQAarchitecture 24
Data management layer
Data layer
Query
decomposition
Data source
selection
Query
execution
Benchmarkin
g
Profiling
Data
qualityData generation
QA pipeline
configurator
Service
repository
Monitoring RESTful API Versioning
Message
dispatcher
Voice to text NL to SPARQLDisambiguator Rel. extraction
UIAnswer
generation
25. 25
Who is the director of
Clockwork Orange?
Understand a
spoken question
Analyse
question
Find data to
answer the
question
Present the
answer
Demo:
http://wdaqua.eu/qa
26. Page 26
How did information flows
change in the digital era?
34. Page 34
New means adapted to the new posibilities were developed,
e.g. „zooming“, dynamics
Business models changed completely
More focus on data, interlinking of data / services and search in the data
Integration, crowdsourcing play an important role
The World of Publishing &
Communication has profundely changed
39. Page 39
WE HAVE
BUT
Mainly based on PDF
Is only partially machine-readable
Does not preserve structure
Does not allow embedding of semantics
Does not facilitate interactivity / dynamicity / repurposing
…
Scientific publishing today
Source: https://www.researchgate.net/publication/264412537_AGDISTIS_-_Graph-Based_Disambiguation_of_Named_Entities_using_Linked_Data
40. Page 40
Scholarly Communication has not changed (much)
17th century 19th century 20th century 21th century
Meanwhile other information intense domains were completely disrupted:
mail order catalogs, street maps, phone books, …
41. Page 41
Challenges we are facing:
We need to rethink the way how research
is represented and communicated
[1] http://thecostofknowledge.com, https://www.projekt-deal.de
[2] M. Baker: 1,500 scientists lift the lid on reproducibility, Nature, 2016.
[3] Science and Engineering Publication Output Trends, National Science Foundation, 2018.
[4] J. Couzin-Frankel: Secretive and Subjective, Peer Review Proves Resistant to Study. Science, 2013.
Digitalisation
of Science
Data integration
and analysis
Digital
collaboration
Monopolisation by
commercial actors
Publisher
look-in effects
Maximization
of profits [1]
Reproducibility
Crisis
Majority of
experiments are
hard or not
reproducible [2]
Proliferation
of publications
Publication output
doubled within a
decade
continues to rise [3]
Deficiency
of Peer Review
Deteriorating
quality [4]
Predatory
publishing
42. Page 42
Science and engineering articles by region, country: 2004 and 2014
Proliferation of scientific literature
Source: National Science Foundation: Science and Engineering Publication Output Trends: https://www.nsf.gov/statistics/2018/nsf18300/nsf18300.pdf
44. Page 44
How can we avoid duplication if the terminology, research problems, approaches, methods,
characteristics, evaluations, … are not properly defined and identified?
How would you build an engine / building without properly defining their parts, relationships,
materials, characteristics …?
Duplication and Inefficiency
Source: https://thumbs.worthpoint.com/zoom/images2/1/0316/22/revell-
4-visible-8-engine-plastic_1_d2162f52c3fa3a6f72d2722f6c50b7b2.jpg
Source: http://xnewlook.com/cad-and-revit-3d-design.html/bill-ferguson-portfolio-computer-graphics-games-cad-related-3d-
models-cad-and-revit-design
45. Page 45
Lack of…
Root Cause –
Deficiency of Scholarly Communication?
Transparency
information is hidden
in text
Integratability
fitting different
research results
together
Machine assistance
unstructured content
is hard to process
Identifyability
of concepts beyond
metadata
Collaboration
one brain barrier
Overview
Schientists look for
the needle in the
haystack
47. Page 47
How good is CRISPR
(wrt. precision, safety, cost)?
What specifics has genome
editing with insects?
Who has applied it to
butterflies?
Search for CRISPR:
> 163.000 Results
Source: https://scholar.google.de/scholar?hl=de&as_sdt=0%2C5&q=CRISPR&btnG=, 04.2019
52. Page 52
1. Original Publication
Chemistry Example: Populating the Graph
2. Adaptive Graph Curation & Completion
Author Robert Reed
Research Problem Genome editing in Lepidoptera
Methods CRISPR / cas9
Applied on Lepidoptera
Experimental Data https://doi.org/10.5281/zenodo.896916
3. Graph representation
CRISPR / cas9 editing
in Lepidoptera
https://doi.org/10.1101/130344
Robert Reed
https://orcid.org/0000-0002-6065-6728
Genome editing in
Lepidoptera
Experimental Data
https://doi.org/10.5281/zenodo.896916
adresses
CRSPRS/cas9isEvaluatedWith
Genome editing
https://www.wikidata.org/wiki/Q24630389
53. Page 53
KGs are proven to capture factual knowledge [1]
Research Challenge: Manage
• Uncertainty & disagreement
• Varying semantic granularity
• Emergence, evolution & provenance
• Integrating existing domain models
But maintain flexibility and simplicity
Cognitive Knowledge Graphs
for scholarly knowledge
ScienceGRAPH approach:
Cognitive Knowledge Graphs
• Fabric of knowledge molecules – compact,
relatively simple, structured units of knowledge
• Can be incrementally enriched, annotated, interlinked …
[1] S Auer et al.: DBpedia: A nucleus for a web of open data. 6th Int. Semantic Web Conf. (ISWC) – 10-year best paper award.
cf. also knowledge graphs from: WikiData, BBC, Google, Bing, Thomson Reuters, AirBnB, BNY Mellon …
54. Page 54
Factual
Base entities Real world
Granularity Atomic Entities
Evolution
Addition/deletion
of facts
Collaboration Fact enrichment
From Factual Knowledge Graphs
Today
55. Page 55
Factual Cognitive
Base entities Real world Conceptual
Granularity Atomic Entities
Interlinked descriptions (molecules)
with annotations (provenance)
Evolution
Addition/deletion
of facts
Concept drift,
varying aggregation levels
Collaboration Fact enrichment Emergent semantics
From Factual to Cognitive Knowledge Graphs
Today ScienceGRAPH
56. Page 56
Research Challenge:
• Intuitive exploration leveraging the
rich semantic representations
• Answer natural language questions
Exploration and Question Answering
ScienceGRAPH Approach:
• KG-based QA component integration for dynamic and
automated composition of QA pipelines for cognitive
knowledge graphs (e.g. following [1])
• Round-trip refinement and integration of search,
faceted exploration, question answering and
conversational interfaces
Question
parsing
Named
Entity
Recognition
(NER) &
Linking
(NEL)
Relation
extraction
Query
con-
struction
Query
execution
Result
rendering
Q: How do different
genome editing techniques compare?
SELECT Approach, Feature WHERE {
Approach adresses GenomEditing .
Approach hasFeature Feature }
[1] K. Singh, S. Auer et al: Why Reinvent
the Wheel? Let's Build Question
Answering Systems Together. The Web
Conference (WWW 2018).
Q: How do different
genome editing techniques compare?
57. Page 57
Engineered Nucleases Site-specificity Safety Ease-of-use / costs/ speed
zinc finger nucleases (ZFN) ++
9-18nt
+ --
$$$: screening, testing to define efficiency
transcription activator-like
effector nucleases (TALENs)
+++
9-16nt
++ ++
Easy to engineer
1 week / few hundred dollar
engineered meganucleases +++
12-40 nt
0 --
$$$ Protein engineering, high-throughput
screening
CRISPR system/cas9 ++
5-12 nt
- +++
Easy to engineer
few days / less 200 dollar
Result:
Automatic Generation of Comparisons / Surveys
Q: How do different genome editing techniques compare?
87. The authorsThe research contribution
The research result
The paperA continuous
variable value
88. Page 88
More projects
Stay tuned
https://tib.eu
Mailinglist/group:
https://groups.google.com/forum/#!forum/orkg
Open Research Knowledge Graph:
https://orkg.org
ERC Consolidator Grant ScienceGRAPH
started in May
Transfer event on International Data Space on
June 19:
https://events.tib.eu/transfer/
89. Page 89
The Team
Prof. (Univ. S. Bolivar)
Dr. Maria Esther Vidal
Software Development
Kemele Endris Farah Karim
Collaborators TIB/L3S Scientific Data Management
Group Leaders PostDocs
Project Management
Doctoral Researchers
Dr. Markus Stocker Dr. Gábor Kismihók Dr. Javad Chamanara Dr. Jennifer D’Souza
Olga Lezhnina Allard Oelen Yaser Jaradeh Shereif Eid
Manuel Prinz
Alex Garatzogianni Laura Granzow
Collaborators InfAI Leipzig / AKSW
Dr. Michael Martin Natanael Arndt
Sarven Capadisli Vitalis Wiens
Wazed Ali
90. Contact
Prof. Dr. Sören Auer
TIB & Leibniz University of Hannover
auer@tib.eu
https://de.linkedin.com/in/soerenauer
https://twitter.com/soerenauer
https://www.xing.com/profile/Soeren_Auer
http://www.researchgate.net/profile/Soeren_Auer
Notas del editor
Die Z3 war der erste funktionsfähige Digitalrechner weltweit und wurde 1941 von Konrad Zuse in Zusammenarbeit mit Helmut Schreyer in Berlin gebaut. Die Z3 wurde in elektromagnetischer Relaistechnik mit 600 Relais für das Rechenwerk und 1400 Relais für das Speicherwerk ausgeführt.
Longquan stoneware incense burner, China, 12th-13th century AD. Part of the Percival David Collection of Chinese Ceramics.
Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal, Sören Auer:MULDER: Querying the Linked Data Web by Bridging RDF Molecule Templates. DEXA (1) 2017: 3-18
D. Diefenbach, K. Singh, A. Both, D. Cherix, C. Lange, S. Auer. 2017. The Qanary Ecosystem: Getting New Insights by Composing Question Answering Pipelines. Int. Conf. on Web Engineering ICWE 2017.
K. Singh, A. Sethupat, A. Both, S. Shekarpour, I. Lytra, R. Usbeck, A. Vyas, A. Khikmatullaev, D. Punjani, C. Lange, M.-E. Vidal, J. Lehmann, S. Auer: Why Reinvent the Wheel-Let's Build Question Answering Systems Together. The Web Conference (WWW 2018).
S. Shekarpour, E. Marx, S. Auer, A. P. Sheth: RQUERY: Rewriting Natural Language Queries on Knowledge Graphs to Alleviate the Vocabulary Mismatch Problem. AAAI 2017: 3936-3943
D. Lukovnikov, A. Fischer, J. Lehmann, S. Auer: Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level. WWW 2017: 1211-1220
We reproduce a statistical hypothesis test published as a result in this paper, namely https://doi.org/10.1093/eurheartj/ehw333
We represent this result in a machine readable form following the concept description for a kind of statistical hypothesis test of the statistical methods ontology (STATO), namely http://purl.obolibrary.org/obo/STATO_0000304
We store the representation in the ORKG database
Specifically, we replicate, describe in machine readable form, and store in the ORKG database the statistical hypothesis test result highlighted and shown here in human readable form
Note that the relevant information is presented in multiple modalities, both text and images, and none of them is easily read and interpreted by machines
In particular, the relevant data is presented as plot in Figure 1 B (image)
Furthermore, the kind of statistical hypothesis test performed, the fact that IRE binding activity is the dependent variable, and the p-value are all implicit information.
We conduct the statistical hypothesis test in Jupyter using Python
We have IRE binding activity data for two groups, called non-failing heart (i.e., healthy individuals) and failing heart (i.e., patients)
We compute a t-test and obtain a p-value
This is the classical workflow a typical researcher would do using SPSS or similar statistical computing environment
However, in contrast to the classical workflow, here we represent and store a machine readable description of the statistical hypothesis test (one that includes the input data, the output p-value, the dependent variable, and the kind of statistical hypothesis test used) in the ORKG
Later, when the paper and its results are published we will be able to relate to this result in the overall research contribution description
Let’s look at how this is done using the ORKG User Interface
We add a paper by DOI lookup or alternatively manually (e.g., if a DOI is not available)
The bibliographic metadata about the paper (title, authors, etc.) is automatically fetched from Crossref and displayed in the user interface
Users can then classify the paper according to research field
More interesting is the possibility to describe the research contributions this paper makes
First, researchers and provide a research problem description
Next, the researcher can further describe the contribution
Here we show how to link to the statistical hypothesis test result obtained earlier in data analysis and published in the paper as a result of this research contribution
We say that research contributions “yield” research results; hence, the “Yields” attribute shown here
The machine readable result has a human readable label which is shown in the user interface by simply typing some included words, here “IRE”
The user can select the correct result and save it
The research contribution can be further described, e.g. the approach used
The paper may make further contributions, which can be described as well
For the purpose here, we skip this and move to the next step
That’s it, the paper description, its research contribution, addressed problem and one result are added
Note that we did not describe the research result, the statistical hypothesis test conducted earlier. We just linked to it!
Let’s look at the paper
The paper can be browsed by research field and is shown as recently added
It can be selected here
Here we see the details, in addition to bibliographic metadata the research contributions of this paper
For the research contribution we just described, we see the problem and we can now inspect the yielded research result
In addition to a human readable label, the statistical hypothesis test description has a specific type, has three inputs and an output, namely the p-value
Let’s look at the output
The output is indeed typed as a p-value, a concept of the Ontology for Biomedical Investigations (i.e., http://purl.obolibrary.org/obo/OBI_0000175)
It has a value specification, namely the specific value computed earlier in data analysis
We can take a look at the value by expanding the value specification
Here it is, the specified numeric value of the computed p-value typed as a scalar value specification, another term of the Ontology for Biomedical Investigations (i.e., http://purl.obolibrary.org/obo/OBI_0001931)
Now, let’s go back to our research result description and take a look at the three specified inputs of the statistical hypothesis test
Here they are: The study design dependent variable and the two continuous variables for failing and non-failing hearts
Let’s take a quick look at what was the study design dependent variable
Where, it was Iron-Responsive Element (IRE) binding
Which is a term of the Gene Ontology (i.e., http://purl.obolibrary.org/obo/GO_0030350)
Finally, let’s take a look at the non-failing heart continuous variable used as specified input in the statistical hypothesis test
Each continuous variable (a term of the statistical methods ontology) has parts, namely scalar measurement data
These are the actual data values and we can explore them
Here is an example, the numeric value 105.0 of the value specification #4 in the non-failing heart continuous variable
Here is an example, the numeric value 105.0 of the value specification #4 in the non-failing heart continuous variable