1. LinkML
Linked (Open) Data Modeling Language
Yosemite Presentation April 2021
Harold Solbrig
Chris Mungall
These slides:
https://tinyurl.com/linkml-2021-april 1
5. “For the semantic web to function,
computers must have access to
structured collections of information and
sets of inference rules that they can use
to conduct automated reasoning.”
“Traditional knowledge-representation
systems typically have been centralized,
requiring everyone to share exactly the
same definition of common concepts
such as "parent" or "vehicle." But central
control is stifling, and increasing the size
and scope of such a system rapidly
becomes unmanageable.”
5
6. Vision of the
Semantic Web:
information →
meaning
“The Semantic Web is not a
separate Web but an
extension of the current one,
in which information is given
well-defined meaning, better
enabling computers and
people to work in
cooperation.”
RDF for machines
Decentralized information networks
Ontologies
Automatic Agents
Digital Signatures
Identify with resolvable http URIs
Prose for humans
Centralized data repositories
Free text
Manual extraction / data wrangling
Unsigned
Identify with strings
6
7. The Semantic Web 20 years Later
Progress
- Web is ubiquitous
- URIs are used
- Agents abound
- Digital signatures and security
have advanced
- Semantics are improving
schema.org
(Not so much) progress
- Decentralization -- Web is
decentralized, but aggregators
dominate (Solid project)
- Semantics -- ontologies abound,
but useful ontologies… not so
much so.
- RDF -- still an afterthought.
Informal models (JSON) or formal
schemas, but semantics are still
largely textual
7
8. Biolink
Model
What lead to LinkML
The charge from NCATS:
● Create a Knowledge Graph Schema
● Encompass all biology from molecules through to clinical entities
● Get 20 different sites using the same data model
○ (oh: Only a handful of which use RDF/OWL)
● Do it quickly and break new ground in Translational Science
8
9. NationalMicro
biome Data
Collaborative
Goal
● Make multi-omics microbiome data FAIR
○ Environments
○ Metagenomes
○ Metatranscriptomes
○ Metabolomics
○ Metaproteomics
● Leverage existing ontologies and
standards
● Enable discovery in microbiome science
9
11. LinkML Philosophy
● Simplicity: YAML source files managed in GitHub
● Multimodal
○ JSON, RDF, Property Graphs
○ Open and Closed World use cases
● Stealth Semantics
○ Let them have JSON and OO Python Data Classes
○ Shh, secretly it’s JSON-LD
● Be a parasite
○ Compiles down to other frameworks; we can then leverage their toolchains
■ JSON-Schema: validation of JSON
■ ShEx: validation of RDF graphs
■ GraphQL: APIs
■ OWL: reasoning, browsers/registries
■ JSON-LD Contexts
11
14. LinkML
“Goals”
Distributed, federated models
● Easy to create and maintain
● Available in multiple forms
● URL Addressable
● Integrated with Github idiom
Automatic tool generation
● Loaders / dumpers
● Format transformations
Baked in semantics
● Everything gets a URL
● Baked in RDF and Semantic links
○ Invisible except when necessary
● Semantic driven model transformation via
RDF
○ JSON-LD and ShEx under the
covers
○ JSON / YAML / CSV on the surface 14
16. 3
1
2
The Yosemite Vision of Data Translation
16
Source Target
Translate based on
crowdsourced rules
Adapted from Graphic by David Booth
17. 3
1
2
...was not without its problems
Adapted from Graphic by David Booth 17
Source Target
Translate based on
crowdsourced rules
- Source doesn’t include formal (RDF)
semantics. 3rd parties must create, validate
and maintain these semantics
- RDF doesn’t lend itself to crowdsourcing
- Structural and semantic differences
mean that both the source and target
need to support not just semantics but
shared semantics.
18. LinkML target model
LinkML source model
LinkML: Embed RDF semantics directly in the Source and target
models; Augment the translation process with ontology and reasoners.
18
Ontology Reasoners
32. # Types
class String(str):
type_class_uri = XSD.string
type_class_curie = "xsd:string"
type_name = "string"
type_model_uri = EX.String
@dataclass
class Person(YAMLRoot):
"""
Minimal information about a person
"""
id: Union[str, PersonId] = None
first_name: Union[str, List[str]] = None
last_name: str = None
knows: Optional[Union[Union[str, PersonId], List[Union[str, PersonId]]]] = empty_list()
def __post_init__(self, *_: List[str], **kwargs: Dict[str, Any]):
if self.id is None:
raise ValueError("id must be supplied")
if not isinstance(self.id, PersonId):
self.id = PersonId(self.id)
if self.first_name is None:
raise ValueError("first_name must be supplied")
elif not isinstance(self.first_name, list):
self.first_name = [self.first_name]
elif len(self.first_name) == 0:
raise ValueError(f"first_name must be a non-empty list")
self.first_name = [v if isinstance(v, str) else str(v) for v in self.first_name]
...
from examples.basic import Person
sam = Person("1172438", first_name=["Samual", "J"],last_name="Snooter")
print(sam)
Person(id='1172438', first_name=['Samual', 'J'], last_name='Snooter',
knows=[])
fred = Person("a117", first_name="John")
...
ValueError: last_name must be supplied
Using python code emitted by LinkML
32
33. The LinkML runtime can consume and create...
JSON
Instance
YAML
Instance
RDF
Instance
Tabular
(CSV, TSV,
Spreadsheet)
Instance
FHIR
Instance
…
Instance
LinkML Runtime
Schema.py
33
34. Generated python can be a gateway to anything...
JSON
Instance
YAML
Instance
RDF
Instance
Tabular
(CSV, TSV,
Spreadsheet)
Instance
FHIR
Instance
…
Instance
LinkML Runtime
Schema.py
Any Jupyter /
Big Data /
Pandas tool
that supports
34
35. from examples.basic import Person
from linkml.dumpers import json_dumper, rdf_dumper
sam = Person("1172438", first_name=["Samual", "J"], last_name="Snooter")
ann = Person("17a3923", first_name="Jill", last_name="Jones", knows=[sam.id])
print(json_dumper.dumps(ann))
print(yaml_dumper.dumps(ann))
print(rdf_dumper.dumps(ann, contexts="../examples/jsonld/basic.context.jsonld"))
{
"id": "17a3923",
"first_name": [
"Jill"
],
"last_name": "Jones",
"knows": [
"1172438"
],
"@type": "Person"
}
id: 17a3923
first_name:
- Jill
last_name: Jones
knows:
- '1172438'
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix sdo: <https://schema.org/> .
<https://example.org/linkml/hello-world/17a3923> a sdo:Person ;
foaf:knows <https://example.org/linkml/hello-
world/1172438> ;
sdo:familyName "Jones" ;
sdo:givenName "Jill" .
python
JSON output YAML output RDF output (by way of JSON-LD)
Objects can be exported as JSON, YAML, or RDF
35
36. from linkml.loaders import yaml_loader
fred = yaml_loader.load('input/fred.yaml', target_class=Person)
print(fred.first_name)
['Fred', 'William']
harvey = json_loader.load('https://raw.githubusercontent.com/hsolbrig/linkml-enhanced-
template/master/tests/input/harvey.json', target_class=Person)
print(harvey.last_name)
Mackerson
ann = rdf_loader.load('input/ann.xml', target_class=Person, fmt="xml")
print(ann.last_name)
Richardson
id: 118-28-3199
first_name:
- Fred
- William
last_name: Phillips
knows:
- '1172438'
- '1172438'
input/fred.yaml
Python code
{
"id": "118-78-0697",
"first_name": [
"Harvey"
],
"last_name": "Mackerson"
}
http://example.org/.../harvey.json input/ann.xml
Objects can be import from JSON, YAML, or RDF
36
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:sdo="https://schema.org/"
>
<rdf:Description rdf:about="https://peoples.r.us">
<sdo:givenName>Ann</sdo:givenName>
<rdf:type rdf:resource="https://schema.org/Person"/>
<sdo:familyName>Richardson</sdo:familyName>
<sdo:givenName>Elizabeth</sdo:givenName>
</rdf:Description>
</rdf:RDF>
38. Slots:
...
gender:
description: Person gender
slot_uri: SDO:gender
range: gender_enum
classes:
Thing:
description: The most generic type of item.
class_uri: SDO:Thing
slots:
- identifier
- url
- name
Person:
is_a: Thing
class_uri: SDO:Person
description: A person (alive, dead, undead, or
fictional).
slots:
- givenName
- additionalName
- gender
38
LinkML incorporates ISO/IEC 11179-3 meaning/data model
ISO/IEC 11179-
3:2013(E)
39. ISO/IEC 11179-3:2013(E) p. 101
A value that can appear in the data
What a particular value means
39
ISO/IEC 11179-3 divides enums into representation / meaning
44. LinkML and the Yosemite vision
LinkML:
- Embed RDF semantics directly in the Source and target models
- Augment the translation process with ontology and reasoners.
LinkML source
model A
LinkML source
model B
Ontology /
Reasoners
Semantic representation of
Model Content
44
46. The LinkML model is developed in LinkML
https://w3id.org/linkml/meta.yaml https://w3id.org/linkml/SchemaDefinition
https://w3id.org/linkml/meta.context.jsonld
46
47. 47
The
Balancing
Act
Information Modeling World
● Explicit structures
● Implicit semantics
● Closed World Assumption
● Classes are Primary
○ Attributes owned by classes
Ontology Modeling World
● Structures are dynamic
● Semantics front and center
● Open World Assumption
● Slots (Predicates) and Classes
(Resources) are co-equals
49. Biolink
Model
Biolink: Goals
The charge from NCATS:
● Create a Knowledge Graph Schema
● Encompass all biology from molecules through to clinical entities
● Get 20 different sites using the same data model
○ (oh: Only a handful of which use RDF/OWL)
● Do it quickly and break new ground in Translational Science
49
50. Biolink
Model
Approach
● Build data model:
○ Main categories (gene, chemical, disease, …)
○ Predicates and associations
■ E.g. chemical treats disease, Gene interacts with
gene
○ .Leverage ontologies
● Collaborative development
○ Domain-specific working groups
○ Anyone can make Pull Requests
Why LinkML?
● Validate using closed-world assumption
● Ontologies and semantics, but in the background
● Property graphs and edges as first-class citizens
50
51. Biolink
Model
Where we are (year 2 or 5)
● All “Knowledge Providers” and “Autonomous Relay Agents”
nominally using Biolink
● Validation dashboard in progress
● Early demonstrations of powerful federated queries
51
52. NationalMicro
biome Data
Collaborative
Goal
● Make multi-omics microbiome data FAIR
○ Environments
○ Metagenomes
○ Metatranscriptomes
○ Metabolomics
○ Metaproteomics
● Leverage existing ontologies and
standards
● Enable discovery in microbiome science
52
53. NationalMicro
biome Data
Collaborative
Approach
● Formalize existing “checklist” standards
● Create modular schema
● Leverage MIxS, ENVO, PROV
Why LinkML
● Developers like JSON + JSON-Schema
● Biologists like spreadsheets
● “Semantic enums” work well
● Needed something that worked with
traditional technology (Mongo, Postgres)
● “Stealth semantics”
○ Everything has URI
○ All JSON is transparently JSON-LD
53
54. NationalMicro
biome Data
Collaborative
Where we are (year 2)
● Unified modular schema
● Heterogeneous data successfully
integrated
○ Environmental
○ Multiple omics types
○ Functional annotation
○ MAG binning
● Ontologies like ENVO used as ‘slot-fillers’
● Easy for developers
○ System based mainly on JSON
exchange
○ RDF can be leveraged
○ Currently Mongo + Postgres
○ Working on TerminusDB adapters
● Working with upstream standards
providers to LinkML-ify checklists
○ Spreadsheets → computable
artefacts 54
55. Other
projects
● Center for Cancer Data Harmonization
○ Cancer sample and patient metadata
○ Omics data
● HOT Ecosystem
○ Health Open Terminologies
○ SKOS metamodel
● Genome Features
○ Formalization of GFF3 schema
○ Sequence Ontology
● Unified Chemistry Datamodel
○ Data model and ontology for chemistry
● Gene Ontology
○ Causal Activity Models
● CSOLink
○ A high level data model of computer
systems
55
56. Help wanted: LinkML is still very much under construction
56
Inquire at monarchinit@gmail.com or w/ authors
58. Contributors
● Chris Mungall (Berkeley Lab)
● Deepak Unni (Berkeley Lab)
● Dazhi Jiao (Johns Hopkins University)
● Harold Solbrig (Johns Hopkins University)
● Richard Bruskiewich (Star Informatics)
● Jim Balhoff (RENCI)
● William Duncan (Berkeley Lab)
● Harshad Hegde (Berkeley Lab)
● Mark Miller (Berkeley Lab)
● Melissa Haendel (CU)
● Matthew Brush (OHSU)
● Sierra Moxon (Berkeley Lab)
● Donnie Winston (Polyneme)
58
59. Funding
LinkML project development was supported by funding from:
● NCATS Translator (OT2 TR003449)
● NIH Monarch (R24 OD011883)
● CD2H (U24 TR002306)
● CCDH
● FHIRCat (R56 EB028101)
● Phenomics First (RM1 HG010860)
● DOE National Microbiome Data Collaborative
59
60. Links and contact information
https://linkml.github.io/
https://github.com/linkml/
https://github.com/linkml/examples/ (Will be available shortly…)
solbrig@jhu.edu - Harold Solbrig
60