This document discusses using Wikidata as a platform for biocuration. Wikidata is presented as a new paradigm that could reduce pain points in current biocuration practices by providing a single platform with persistent data access. It describes Wikidata's structure as a knowledge base of unique items and statements linked together to form a knowledge graph. Examples show how biomedical data like genes and proteins are represented. The document outlines Wikidata's community processes and increasing impact on applications like Wikipedia and genome browsers. It envisions the potential for researchers to contribute new biomedical knowledge through Wikidata.
Introduction,importance and scope of horticulture.pptx
Wikidata workshop for ISB Biocuration 2016
1. Wikidata as a platform for
biocuration
Benjamin Good
The Scripps Research Institute
@bgood
2. Organization , more information
http://tinyurl.com/biocuration-wikidata
• Part 1: getting to know wikidata
• A platform for biocuration (Ben Good)
• Wikipedia and Wikidata for research (Daniel Mietchen)
• Editing Wikidata (Sebastian Burgstaller-Muelbacher)
• Coffee Break
• Part 2: getting your hands on wikidata: flash
biocuration jamboree!
3. API
Flatfiles
The dominant paradigm for open biocuration
API
Flatfiles
Your
Database
Your
Database
Your
Databasexrefs
Your
Database
Pain points
• API or flatfile parsing
• Ambiguous or non-existent xrefs
• Persistence of funding
• Too much information to curate
My Web
Application
My Database
My Database Curators
My Research Grants
$
Biomedical
knowledge
4. A new paradigm for open biocuration?
Our
Applications
Our Database?
Our Database Curators
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
My Research Grants
$
Reducing the pain
• Reduces API/parser proliferation
• Forces up-front integration
• Facilitates coordination
• Ensures that if funding is lost,
data is not
• Invites community input
5. A new platform for open biocuration?
Our
Applications
Our Database Curators
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
My Research Grants
$
6. Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
• Initiated by WikiMedia Germany
• In transition to the WikiMedia Foundation
• Not a ‘project’… as stable as Wikipedia
8. Elements of the kb are called ‘items’
• Labels and
descriptions in
many languages
9. Items are unique concepts,
used to link different language
Wikipedias together
Q146
Af:Kat
En:cat
Als:Hauskatze
Ang:Catte
Av:Keto
10. Items are described by “statements” that link
together to form the language-independent
wikidata knowledge graph
Cat
Domesticated
Animal
Animal
Subclass Of
Subclass Of
Animalia
Taxon name
Kingdom
Taxon rank
12. Item: Q414043
RELN
Genomic start: 103471784
GenLoc assembly:
GRCh38
Stated in:
Ensembl Release 83
Retrieved:
19 January 2016
Value (numeric)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Genomic position for Reelin gene
13. Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
Linking the Reelin gene to a protein it encodes
14. Item: Q13561329
Reelin
Cell component: dendrite
Determination method:
• ISS (Sequence or structural
Similarity)
• IEA (Electronic annotation)
Stated in:
Uniprot
Retrieved:
21 March 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q13561329
Statement
Gene ontology annotation for Reelin protein
with evidence codes modeled as qualifiers
23. A new platform for open biocuration?
Our
Applications
Our Database Curators
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
My Research Grants
$
• SPARQL = a common
API for accessing
content
• 1 endpoint to
maintain…
• Its working
24. Wikidata and the Semantic Web
• Hub for linked open data
• A lot of initial contents are
identifier links (e.g. we link
drug items to 18 different
schemes)
• e.g. see Vemurafenib
• Supports federated queries
– e.g. you can do one query
that spans wikidata content
and uniprot RDF content
On its way to replacing
Dbpedia as the central node
26. Social controls
• Anyone can
• Add or edit labels, descriptions, statements, references etc. on existing items
• Create new items
• Link items to Wikipedia articles
• Query using https://query.wikidata.org
• Read and write small numbers of edits with
https://www.wikidata.org/w/api.php
• Propose a new property
• Request a bot account for high-volume automated editing
Here be dragons..
27. Properties (as of April 10, 2016)
• 2196 active properties
• 114 new properties that have been proposed but not yet approved
Proposal
https://www.wikidata.org/wiki/Wikidata:Property_proposal
28. After proposal, community discussion
• Each property is left open
for discussion by anyone
until
• An administrator or other
person blessed with the
power either creates it or
decides not to create it
based on the discussion
• People that enjoy ontology
arguments needed here!
Lengthy (cut-off) discussion of proposal for ‘extinct’ property
31. Proposal discussions
• Can not be avoided
• The discussions are long and tiring but important
• Many of the people involved are quite experienced
• All are trying to make something great
• Persistence and patience required
33. The first application built on wikidata, Wikipeda
Our Database Curators
And our community
Biomedical
knowledge
Our
Applications
Our
Applications
Su, Schriml, Pavlidis R01 Grant…
$
36. Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)
37. Wikidata use increasing on Wikipedia
• https://en.wikipedia.org/wiki/Category:
Templates_using_data_from_Wikidata
• 81 templates indicate that they use it
40. The next application built on wikidata, yours?
Our Database Curators
And our community
Biomedical
knowledge
???? ????
$
My Research Grants
Its your data as
much as anyone
else’s!
41. Current state: seeding nodes for the graph
• All human, mouse genes and proteins (swissprot)
• All Gene Ontology terms
• All FDA approved drugs
• All Human Disease Ontology terms
• 109 reference microbial genomes
Burgstaller-Muelbacher et al (2016) Database
Mitraka et al (2015) Semantic Web Applications for the Life Sciences
Putman et al (2016) Database
42. Next data step: connecting the nodes
• Our group
• Human: Gene-disease (PhenoCARTA team)
• Human: Drug-disease (CHEMBL, NDF-RT)
• Human: Gene-drug (CHEMBL, NDF-RT)
• Expanding microbial information (Putman, Koehurst, Knight lab)
• Your group
• ?
• Today
• A knowledge base for understanding Zika ?
43. Acknowledgements
Gene Wikidata Team
Andra Waagmeester (Micelio)
* Sebastian Burgstaller (Scripps)
* Tim Putman (Scripps)
* Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu
@bgood on twitter* First author on manuscript cited in this presentation
Adapted logo
Su Laboratory at TSRI The 16,950 other active editors of
Wikidata and especially the 693 that
joined last month and the 809 that
joined the month before that and
the 721 that joined the month
before that..
Notas del editor
Wikidata: Advancing science through semantic integration of genes, diseases, and drugs
This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration.
Before coming back to this, I will explain why.
This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration.
May
This is the central point I want to make. Wikidata can be used to to build knowledge-based applications, lowering the barrier to entry for building apps and reducing challenges of downstream data integration.
Before coming back to this, I will explain why.
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
Stable since around September 2015
Successor to DBpedia
Simplest and first use case are identifier mapping
Supports linking of distributed linked data sets
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
Stable since around September 2015
This is the first application of the work that we have done
By mixing the data into wikidata, we reduce API proliferation, easing application formation.
Over 1 billion triples
Fast
Stable since around September 2015