3. Producing Life Sciences Linked Data
(Problems)
Most Linked Open Data is created and provided
without the help of the original data provider who
Almost all Linked Open Data in Life Sciences is provided by Bio2RDF
4. Producing Life Sciences Linked Data
(Problems)
• Data Base is a life’s work for a biologist and He/she
wants to publish it
– but not to lose the control
• An RDF dump of the DB is cheap
– but supporting Queries and Data Analysis is expensive
– where is the money comming from?
• They are very motivated to add value to the data
– but they are still lacking up to date ICT skills
• Help is wanted to kill Bio2RDF
Almost all Linked Open Data in Life Sciences is provided by Bio2RDF
5. Consuming Linked Data
• Number of Linked Data repositories will keep growing
• Use of Linked Data in Life Sciences means Linking data
with existing tools which are de facto standards in certain
subdomains:
• Pathways
http://sbmm.uma.es
• Proteins
6. Consuming Linked Data
• Data Analysis Services not only queries but also Data
Mining, Crawling, and Reasoning are need to engage
community
– BioMedical uses (Pharmaceuticals testing, drug screening)
7. Consuming Linked Data
• Reasoning, removed to make data reuse possible,
should be re-introduced in some cases over real
complex ontologies with large sets of data
– BioPax Level 3 (Level 4 under development)
• OWL Species: DL
• DL Expressivity: SHIF(D)
• Consistent: Yes
– BioPax Level 3 (4 officially identified databases, more DBs public
data as BioPax Level 3 instances)
• Reactome Database
– 1.54 GB
– 2 980 230 triples
– BioPax Level 2 (9 officially identified databases)
• Previously, data and ontologies should be cleaned up
8. Consuming Linked Data
• Reasoning Services over real complex ontologies with
large sets of data
– Cost reduction in experiment design
– Hypothesis demonstration/refutation
– Privacy in reasoning with public + private data
9. Consuming Linked Data
• Reasoning for classification problems
– Disease classification / diagnosis
– Protein identification
– Pathway alignment
12. Scalability Issues in Life Sciences
• Real scenarios with rich ontologies are starting to
appear:
– BioPax Level 3 4: complex OWL ontology (transitive, reflexive,
inverse and functional properties, restrictions in most of the
classes, 70 classes)
– Big data sets in OWL format (from 20MB to 45GB of data)
– Problems with the data:
• undetected Abox (even Tbox problems) inconsistencies because of
the lack of scalable reasoners
• Lack of SPARQL endpoints to query these data
13. Summary: Are we losing the war?
• Producing Linked Data in Life Sciences: Some risks and
some needs detected:
– A motivating rewarding schema for the data owner
– Some specific infrastructure (action, facility, institute, foundation,
private…) support could be useful
• to engage data owners,
• to aport tecnnical capability and
• to share costs
14. Summary: Are we losing the war?
• Consuming Linked Data in Life Sciences Opportunities
– Connecting Linking data with existing tools which are de facto
standards in certain LS subdomains
• to multiply impact
– Not only Queries Services but also Data Analysis Services
(Crawling, Mining, Reasoning, etc.) should be provided to the
community
• but this is expensive for the average DB owner
– Data must be cleaned up, curate and cross-validated
• main thread
– Domain is lacking specific user interfaces
• this is related with the connection of LD to (de facto) standard tools
– In this domain makes sense to reason
• but scalability is still an issue
15. Linked Data and Life Sciences
José F. Aldana Montes
jfam@lcc.uma.es