The document discusses three lessons learned from the Bio2RDF project about producing, publishing, and consuming linked data. Lesson 1 is that data transformation to RDF is an ETL task best done using frameworks like Talend. Lesson 2 is to publish semantic data using triplestores like Virtuoso and make SPARQL endpoints publicly available. Lesson 3 is that semantic data sources can be consumed in various ways, including SPARQL queries, HTTP requests, and SOAP services.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Producing, Publishing and Consuming Linked Data Three lessons from the Bio2RDF project
1. Producing, Publishing and
Consuming Linked Data
Three lessons from the Bio2RDF project
Background François Belleau (francoisbelleau@yahoo.ca)
With the proliferation of new online databases, data integration continues to be one of the major unsolved problems for
bioinformatics. In spite of initiatives like BioPAX [1], Biomart [2], the EBI, KEGG and NCBI integrated web resources, the web Lesson # 1
of bioinformatics databases is still a web of independent data silos. Rdfise data using ETL software like Talend.
Since 2005, the aim of the Bio2RDF project has been to make popular public datasets available in RDF format; the data
description format of the growing Semantic Web. Initially data from OMIM, KEGG, Entrez Gene, along with numerous other
resources, were converted to the RDF semantic format. Currently 38 SPARQL endpoints are made available from the
Bio2RDF server [3].
This is the workflow producing
Bio2RDF project has been the primary source of bioinformatics data in the Linked Data cloud in 2009. Today many triples from Genbank HTML web page about
organisations have started to publish their datasets or knowledge bases using the RDF/SPARQL standard. GO, Uniprot and external database references.
Reactome were early converts to publishing in RDF. Most recently PDBJ, KEGG, NCBO have started to publish their own
data in the new semantic way. From the data integration perspective projects like BioLOD [4] from the Riken Institute and
Linked Life Data [5] from Ontotext have pushed the Semantic Web model close to production quality service. The linked Data
cloud of bioinformatics is now rapidly growing [6]. The technology incubation phase is over.
One question data provider should ask themselves now is : How costly is it to produce and publish data in RDF
according to this new paradigm ? And, from the bioinformatician data consumer point of view : How useful can semantic These are the instructions creating triples
web technologies be to build the data mashups needed to support a specific knowledge discovery tasks and the from the data flow.
needs of domain experts ?
These are the questions we answer here by proposing methods for producing, publishing and consuming RDF data, and by
sharing the lessons we have learnt while building Bio2RDF.
Producing RDF
RDF is all about triples, building triples, storing triples and querying triples. A triple is defined by the subject-predicate-object
model. If you have used key-value table before, you already know what triples are. A collection of triples define a graph so
generic that all data can be represented using it. Every kind of data can be converted in triples from all known formats: HTML,
XML, relational database, columns table or key-value representation. Converting data to RDF is so important to build the
Semantic Web that it is expressed by a new verbs : triplify or rdfize ! Building the Bio2RDF rdfizers we had to deal with all Expose data as RDF using dereferencable URI
those kind of data formats and sources. according to design rule #1 and #2
Lesson #1 Transforming data in RDF is an ETL (Extract Transform Load) task and there are
now free and professional frameworks available for this purpose.
Talend [7] software is a first class ETL framework, based on Eclipse, generating native java code from a graphical
representation of the data transformation workflow. Using this professional quality software to rdfize data is much more Make a SPARQL endpoint public
productive than writting Java, Perl, PHP scrits as we use to do in the past. so query can be submited.
To build the namespace SPARQL endpoint at Bio2RDF [8], a RDF mashup composed of GO, Uniprot, LSRN, GenBank, Here is a query used to discoverer
MIRIAM and Bio2RDF namespace description, we generated RDF from XML, HTML, key/value format file, tabular file and the schema of an unknown triplestore.
also an RDF dump. Using Talend ETL framework has made the programming job and quality testing far more efficient.
Publish on the Linked Data web
The inventor of HTML, Tim Berner Lee, has also define the rules by which the Semantic Web should be designed [9]:
1) Use URIs as names for things
2) Use HTTP URIs so that people can look up those names.
3) When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
4) Include links to other URIs. so that they can discover more things.
Building Bio2RDF, we have been early adopters of those rules. The DBpedia project, a version of Wikipedia available in RDF
format and through one of the first major public SPARQL endpoints, is at the heart of the Linked Data cloud, it is built using
Virtuoso triplestore [10], a first class software, that is free and open-source.
Lesson ##2
Lesson 2
To publish semantic data use
To publish semanticVirtuoso
a triplestore like data use
a triplestore like Virtuoso
Lesson #2 To publish semantic web data chose a good triplestore and made
a SPARQL endpoint available publicly on the Internet. Discover concepts using
type ahead search
Bio2RDF project has also depended on Virtuoso, and benefits from all the innovation in each new version. Virtuoso not only
Full text search query results with ranking
offers SPARQL endpoint to submit queries based on the W3C standards, full text search and facet browsing-based user based on the number of connections in the graph
interface are available so the RDF graph can be browsed, queried, searched and explored with type ahead completion
service. All this from one software product directly out of the box.
Sesame [11], 4Store [12], Mulgara [13] and other new projects emerging each year make publishing data over the web a new
affordable reality.
Consuming triples
Why should we start using Semantic Web data and technologies ? Because building a database from public resources on the
web is more efficient than the traditional way of creating datawarehouse. The Giant Global Graph (GGG) of the entire
Semantic Web is the new datastore you can build your semantic mashup from with the tools of your choice.
To answer a high level scientific question from data already available in RDF, you need first to build a specific triplestore that
you will eventually be able to query to, and hopefully, will obtain the expected results. Building a specific database just to
answer a specific question, this is what semantic data mashup are about.
Lesson # 3
Lesson #3 Semantic datasources available from SPARQL endpoint can be consumed in all kind Consume semantic data as you like,
of ways to create mashup. using HTTP GET, SOAP services
or new tool designed to explore RDF data.
For example the following ways of consuming RDF include; (i) SPARQL queries over REST, (ii) dereferenced RDF graph by
Using soapUI popular tool [16]
URI over HTTP, (iii) SOAP services returning RDF or even better still (iv) the new web services model proposed by SADI
you can consume Bio2RDF's SOAP
framework [14]. Programming in Java, PHP, Ruby or PERL, using RDF/XML, Turtle or JSON/RDF format is also possible and services returning triples in ntriple format.
the needed software get better each year. Its is a wild new world of open technologies you will benefit from and to learn and
use.
The Bio2RDF project first offered an RDF graph that could be dereferenced by a URI in the form
http://bio2rdf.org/omim:602080. Any HTTP GET request will return the RDF version of a document from one of the database
we expose as RDF in the format of your choice. Next, you can submit queries directly to one of our public SPARQL endpoints
like http://namespace.bio2rdf.org/sparql. Programming a script or designing a workflow with software like Taverna or Talend,
you can build your data mashup from the growing semantic web data sources in days, not weeks.
To explore the possibilities offered by a triplestore, discover the Bio2RDF SPARQL endpoint about bioinformatics database at
http://namespace.bio2rdf.org/fct, submit SPARQL queries to its endpoint at http://namespace.bio2rdf.org/sparql
And, if you are a SOAP services user, consume its web services described here
http://namespace.bio2rdf.org/bio2rdf/SOAP/services.wsdl.
Discussion
Combining data from different sources is the main problem of data integration in bioinformatics. The Semantic Web
community have addressed this problem for years, now the emergent Semantic Web technologies are mature and ready to
be used in production scale systems. The Bio2RDF community think that solving data integration problem in bioinformatics
can be solve by applying existing Semantic Web practices. The bioinformatics community could significantly benefit from what
is being developed now, in fact our community has done a lot to show that Semantic Web model has a great potential in Using the RelFinder tool [15]
solving Life Science problems. By sharing our own Bio2RDF experience and these simple lessons we have learned, we hope it is possible to query RDF graphically
and visualise the triplestore's graph.
that may be you should give it a try in your next data integration project.
Acknowledgements
References ● Bio2RDF is a community project available at http://bio2rdf.org
● The community can be joined at https://groups.google.com/forum/?fromgroups#!
1) http://www.biopax.org/ 9) http://www.w3.org/DesignIssues/LinkedData.html forum/bio2rdf
2) http://www.biomart.org/ 10) http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/ ● This work was done under the supervision of Dr Arnaud Droit, assistant professor and
3) http://www.bio2rdf.org/ 11) http://www.openrdf.org/ director of the Centre de Biologie Computationnelle du CRCHUQ at Laval
4) http://biolod.org/ 12) http://4store.org/ University, where a mirror of Bio2RDF is hosted.
5) http://linkedlifedata.com/ 13) http://www.mulgara.org/ ● Michel Dumontier, from the Dumontier Lab at Carleton University, is also hosting
6) http://richard.cyganiak.de/2007/10/lod/ 14) http://sadiframework.org Bio2RDF server and actually leads the project
7) http://talend.com/ 15) http://www.visualdataweb.org/relfinder.php ● Thanks to all the people member of the Bio2RDF community, and especially Marc-
8) http://namespace.bio2rdf.org/sparql 16) http://www.soapui.org/ Alexandre Nolin and Peter Ansell, initial developers.