Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
The Semantic Web & Web 3.0
1. Ready for Web 3.0?
A Practical Guide to the Semantic Web for Publishers
Introduction
The web as we know it has completely transformed the way we socialise, shop, communicate and
generally consume information. However, the full potential of the internet as a medium has yet to
be realised.
The Semantic Web is the next step (or more accurately leap!) in the evolution of the web. Moving
us from a position where individual documents (whether they are articles, chapters, images, raw
data sets, multi-media files, encyclopaedia etc.) exist within individual data silos to Tim Berners-
Lee’s vision of an integrated web of linked data, where mash-ups, apps, and context sensitive
discovery routes are the norm.
“The most exciting thing about
As we enter the next phase of the Web publishers are well placed to benefit the Semantic Web is not what
from new technologies to add value to content assets, increase discoverability, we can imagine doing with it,
cross promote products and ultimately remain competitive in terms of but what we can't yet imagine it
attracting authors and customers. will do”
- Sir Tim Berners-Lee
Does your publishing strategy consider the Web 3.0 needs of your content and
your audience? This document cuts through the hype to provide a high level practical guide to the
buzz words, opportunities, challenges plus examples of real life semantic web applications.
What is the Semantic Web?
…..an extension of the World Wide Web that provides an easier way to find, share and combine
information from disparate sources. In the simplest terms it’s the relationship between things,
described in a way which can be understood by people and machines.
World Wide Web = Web of Documents with Limited Interoperability
Semantic Web = Web of Integrated, Linked Data
The Semantic Web is about:
Common formats for the integration and combination of data drawn from diverse
sources. RDF, OWL, SPARQL are a set of Semantic Web standards led by the W3C
(please see the glossary below for further information).
Content structured in a semantic way so that it is meaningful to computers and to
humans. Ontologies and taxonomies provide the core structure, allowing for new
ways of navigating and discovering content.
Language for representing how the data relates to real world objects – this allows a
person or a machine to understand context and provide a more relevant user
experience.
“Data on the web defined and linked in a way that it can
be used by machines not just for display purposes, but
for automation, integration and reuse of data across
various applications”
- Sir Tim Berners-Lee
2. Metastore
In 2005 Publishing Technology began a two year research and development exercise to
determine the format of the future to be used as the backbone for all of our online products.
Semantic web technologies were identified as offering benefits over and traditional forms of
data storage. More specifically, RDF was determined as the optimal format to future proof
content, to support flexible content models i.e. thinking beyond the traditional article and
chapter containers to provide a data storage model that is truly content agnostic and can easily
handle journals, books, images, multi-media but also support the most granular concepts such
as a taxonomy item, author name or a snippet of text. To provide a solid foundation from which
publishers can experiment with semantic web technologies at their own pace.
Metastore was built using the Jena Semantic Web Framework, open source technology
originally created by Hewlett Packard Labs. Publishing Technology won the “Best Applications”
paper at the Hewlett Packard Labs conference in 2006 for the initial prototype –
groundbreaking as the largest commercial RDF triplestore. Metastore is now in live production
and holds content from 270 publishers via the IngentaConnect and pub2web services. There is
no requirement for publishers to change their workflow to benefit, Publishing Technology
accept a wide range of data inputs (PDF, XML, typesetting files, hardcopy) and convert content
into RDF as part of the standard service. Therefore, providing a solid foundation for publishers
to experiment without making changes to their existing workflow.
Breathing Space: http://respiratory.publishingtechnology.com
Breathing Space is the first semantic web collaboration between scholarly societies and aims to
explore the value to researchers of compiling and mining a critical mass of data within a discipline;
the project collates content from the European Respiratory Society and American Thoracic Society
who together account for 30% of citations within the field of respiratory medicine. Publishing
Technology’s Metastore and pub2web technologies are combined with open source data mining
software (Whatizit provided by the European Bioinformatics Institute) to showcase a number of
semantic web features and functionality including:
Faceted search & browse, allowing users to explore the content by Disease, Gene
Ontology Terms, Species, Authors and Images and refine search results by their
preferred facet.
Tag Clouds.
Disease, Gene Ontology & Author homepages. Automatically derived from the
publisher’s content and combined with additional information from trusted, external
open data sets (Bio2RDF, DBPedia, LinkedCT.org)
Enriched user experience, presenting relevant information from disparate sites in a
single interface.
Visualization of related concepts offers a new way to explore the content and follow
through lines of research.
3. “This project is helping us to explore our members’ expectations and information habits,
as the role of society publishers in disseminating research evolves,” says Elin Reeves, from the
European Respiratory Society. “We’re very pleased that Publishing Technology have joined the
project as technology partner; their leadership in the deployment of semantic technologies within
highly functional publication websites guarantees us the robust, innovative platform we need to
support the project.”
TBI Communications is coordinating the project and will be presenting the results of the
experiment at the Society for Scholarly Publishers Annual Meeting in June 2010. The results will
focus specifically on feedback from authors, researchers and the editorial board.
4. What does it mean for me?
The semantic web has been overhyped but is now gaining momentum; it is increasingly being used
within the publishing community, by governments for the management of data and by news
services to provide a more interactive, engaging user experience within numerous communities by
integrating information from a variety of sources to present relevant information to the reader in a
single location, therefore reducing the number of clicks to get to what you need.
A broad understanding of the foundation technologies as ways for readers to manage the ever
increasing volumes of information and content is critical for publishers. How do you ensure your
content is visible, discoverable and linked? How can you structure your workflow and content to
benefit from the semantic web? Once the foundation technology is in place the possibilities are
endless. Now is the time to experiment and explore what your audience is expecting from your
content.
The volume of information available on the web has exploded in recent years and it is becoming
increasingly difficult to quickly find relevant and trusted information amongst the “noise” on the
web. Publisher, society and journal brands are key markers of quality.
Publishers have a huge amount of specialist information at their fingertips in terms of authors and
in-house experts. The challenge is how can this expertise and knowledge be efficiently harnessed
to enrich content and online products?
Workflows can be adapted to capture additional information as part of the publishing process but
this takes time to implement and encourage. Automated data mining can offer a compromise for
semantically enriching your content now. Data mining allows for concepts, facts and other relevant
information to be extracted from content. Once extracted the concepts or facts can be used to
drive new discovery routes to relevant content, new products can be created on the fly giving
publishers the competitive edge.
A more “lively” and interactive user experience can be driven via the use of semantic web
technologies by the integration of external data sets to create a more valuable product, the display
of supplementary data which could take many forms (a YouTube clip, raw statistics, audio files,
images, podcast etc.), greater interaction and
Opportunities for publishers:
Opportunities differ depending on the subject area.
Start simple and add functionality incrementally – experiment!
Utilize in-house experts and author knowledge to enrich content.
Enhance discoverability and visibility of your content.
Cross marketing benefit of offering new routes to your content.
Relevancy and precision of search results encourages return visits to your site.
Gain competitive edge.
Challenges to consider:
Adapting workflow: how can additional information be captured as part of the
standard workflow?
Balancing automated semantic enrichment ‘v’ manual specialist input.
Authority, trust, provenance of open data sets.
Setting clear terms of reuse and attribution (Creative Commons suggested).
5. Buzz Words: A Glossary
This is not intended to be comprehensive, rather provide a snapshot of the key terms used when
discussing the Semantic Web. The World Wide Web Consortium (W3C) is an invaluable resource
for further information the Semantic Web and associated standards
(http://www.w3.org/standards/semanticweb/).
Linked Data
“…a term used to describe a recommended best practice for exposing, sharing and connecting
pieces of data, information, and knowledge on the Semantic Web using URIs and RDF” Wikipedia
For example, DBPedia is the semantic web (or RDF) version of Wikipedia. By creating this
alternative view of Wikipedia using semantic web technologies other content creators and sites
can automatically harvest data from DBPedia for inclusion within their own sites. This may be
useful in the context of providing further background on a topic, integrating definitions etc.
alongside published content. The LinkingOpenData Project’s goal is to:
“extend the Web with a data commons by publishing various open data sets as RDF on the Web
and by setting RDF links between data items from different data sources. RDF links enable you to
navigate from a data item within one data source to related data items within other sources.
RDF links can also be followed by the crawlers of Semantic Web search engines, which may
provide sophisticated search and query capabilities over crawled data. The figures below show
the data sets that have been published and interlinked by the project so far. Collectively, the
data sets consist of over 13.1 billion RDF triples, which are interlinked by around 142 million RDF
links (2009). “
The Linking OpenData Project Cloud Diagram below gives a flavour of the data sets currently
available.
Linking Open (LOD) Data Project Cloud Diagram, http://linkeddata.org/
6. Mash up
“ …a web application that combines data from more than one source into a single integrated tool,
thereby creating a new and distinct Web service that was not original provided by either source”
Wikipedia
Examples include:
WikiFM a mashup between Last.FM and Wikipedia allowing you to view the artist and
song information from Wikipedia while listening to Last.FM radio.
Integrating field study data with Google Maps to allow readers to focus in on
geographical areas of interest and date range as well as providing ease of integration
of additional data sets which may result in new discoveries and lines of research.
RDF (Resource Description Framework)
A framework for expressing data as subject – predicate – object triples. For example:
Charles Dickens wrote Pickwick Papers
Subject Predicate Object
Relationships with other RDF statements can therefore be inferred i.e. what else did Charles
Dickens write? Which critics have discussed his work? What topics are covered within the
Pickwick Papers and so on a web of relationships can be automatically inferred.
RDF is designed to describe data in a distributed world (and is therefore the standard format for
linked data). Everything expressed in RDF means something, whether a reference to a physical
object, an abstract concept, or a fact. Standards built on RDF describe logical inferences between
facts.
By storing data in RDF it can be more easily integrated with other sets of data to create entirely
new datasets and visualizations.
SPARQL
The query language for the Semantic Web.
Web 3.0
Increasingly being used as an alternative name for the Semantic Web, but not an official term. The
Semantic Web is often described as being a component of Web 3.0:
Tim Berners-Lee: “People keep asking what Web
3.0 is. I think maybe when you've got an overlay
of scalable vector graphics - everything rippling
and folding and looking misty — on Web 2.0 and
access to a semantic Web integrated across a
huge space of data, you'll have access to an Conrad Wolfram:
unbelievable data resource." “Web 3.0 is where "the
computer is generating new
information, rather than
humans”
7. Events: Want to hear more about the Semantic Web?
Event: London Book Fair Supply Chain Seminar
Date: 21st April 2010, London
Title: Semantic Web Technologies and Book publishing
Speaker: Priya Parvatikar (Technical Architect, Publishing Technology)
URL: http://www.londonbookfair.co.uk/page.cfm/Action=Seminars/SeminarID=56
Event: Society for Scholarly Publishing Annual Meeting
Date: 3rd-4th June 2010, San Francisco
Presentation: Semantic societies: uncovering new research paths, engaging members better
Speaker: Charlie Rapple (Head of Marketing Development, TBI Communications)
URL: http://www.sspnet.org
Event: “Ready for Web 3.0?” ALPSP seminar, chaired by Louise Tutton (COO, Publishing
Technology)
Date: 1st July 2010, London
Presenters include: Priya Parvatikar (Publishing Technology), Leigh Dodds (Talis), Outsell,
Biochemical Society, Southampton University and Charlesworth.
URL: http://www.alpsp.org/
Contact:
For more information Publishing Technology, its products and services,
please visit www.publishingtechnology.com or contact:
Emily Taylor
Publishing Technology plc
Tel: +44 1865 397873
Email: emily.taylor@publishingtechnology.com
About Publishing Technology plc
Publishing Technology’s brands include advance, IngentaConnect, VISTA, author2reader,
pub2web, ICS and PCG. The Publishing Technology group enables publishers to focus on their
core competences by providing a single, trusted partner for both technology requirements and
business development services. It is one of the largest providers of software and services for
the publishing industry, servicing eight out of ten of the world's largest publishers. The group’s
proposition uniquely spans front and back office systems – complemented by a range of
business development services – to provide the industry's only end-to-end suite of software
specifically designed to support the publishing process. Capabilities cover editorial &
production, product information, billing & fulfilment, content conversion & hosting, website
development, marketing programs, information commerce, customer relationship
management, rights & royalties and business intelligence. All application modules can be
configured independently to meet specific publishers' needs and to allow flexible integration
with existing systems. Associated sales and marketing services include consultancy and
research, sales representation and telemarketing. The company is listed on the AIM market of
the London Stock Exchange and has offices in Europe, North America and Australia.
Related Links
Visit at www.publishingtechnology.com, follow on Twitter @publishingtech, or connect with us
on LinkedIn.