bonino_thesis_final

POLITECNICO DI TORINO
SCUOLA DI DOTTORATO
Dottorato in Ingegneria Informatica e dell’automazione – XVIII ciclo
Tesi di Dottorato
Architectures and Algorithms for
Intelligent Web Applications
How to bring more intelligence to the web and beyond
Dario Bonino
Tutore Coordinatore del corso di dottorato
prof. Fulvio Corno prof. Pietro Laface
Dicembre 2005

Acknowledgements
Many people deserve a very grateful acknowledgment for their role in supporting me
during these long and exciting 3 years. As ﬁrst I would cite my adviser, Fulvio Corno,
which always supported and guided me toward the best decisions and solutions.
Together with Fulvio I want to thank Laura Farinetti very much too: she was there
at anytime I needed her help for both insightful discussions and stupid questions.
Thanks to all my colleagues in the e-Lite research group for their constant support,
for their kindness and their ability to ignore my bad moments. Particular thanks
shall go to Alessio for being not only my best colleague and competitor, but one of
the best friend I’ve ever had. The same goes to Paolo, our railway discussions have
been so much interesting and useful!
Thanks to Mike that introduced me to many Linux secrets, to Franco for being
the calmest person I’ve ever known and to Alessandro ”the Eye tracker”, the best
surfer I’ve ever met.
Thanks to my parents Ercolino and Laura and to my sister Serena, they have
always been my jumping board and my unbreakable backbone.
Thank you to all the people I’ve met in these years and that are not cited here,
I am very glad to have been with you, even if for only few moments, thank you!
I

Contents
Acknowledgements I
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 The Semantic Web vision 8
2.1 Semantic Web Technologies . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Explicit Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Semantic Web Languages . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 RDF and RDF-Schema . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 OWL languages . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 OWL in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Web applications for Information Management 25
3.1 The knowledge sharing and life cycle model . . . . . . . . . . . . . . 26
3.2 Software tools for knowledge management . . . . . . . . . . . . . . . 27
3.2.1 Content Management Systems (CMS) . . . . . . . . . . . . . . 28
3.2.2 Information Retrieval systems . . . . . . . . . . . . . . . . . . 34
3.2.3 e-Learning systems . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Requirements for Semantic Web Applications 49
4.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Use Cases for Functional Requirements . . . . . . . . . . . . . . . . . 53
4.2.1 The “Semantic what’s related” . . . . . . . . . . . . . . . . . 53
4.2.2 The “Directory search” . . . . . . . . . . . . . . . . . . . . . . 53
II

4.2.3 The “Semi-automatic classiﬁcation” . . . . . . . . . . . . . . . 54
4.3 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . . . 56
5 The H-DOSE platform: logical architecture 59
5.1 The basic components of the H-DOSE semantic platform . . . . . . . 60
5.1.1 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Principles of Semantic Resource Retrieval . . . . . . . . . . . . . . . . 69
5.2.1 Searching for instances . . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Dealing with annotations . . . . . . . . . . . . . . . . . . . . . 70
5.2.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.4 Searching by conceptual spectra . . . . . . . . . . . . . . . . . 73
5.3 Bridging the gap between syntax and semantics . . . . . . . . . . . . 74
5.3.1 Focus-based synset expansion . . . . . . . . . . . . . . . . . . 75
5.3.2 Statistical integration . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Experimental evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Multilingual approach results . . . . . . . . . . . . . . . . . . 79
5.4.2 Conceptual spectra experiments . . . . . . . . . . . . . . . . . 81
5.4.3 Automatic learning of text-to-concept mappings . . . . . . . . 83
6 The H-DOSE platform 85
6.1 A layered view of H-DOSE . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.1 Service Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.2 Kernel Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.3 Data-access layer . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1.4 Management and maintenance sub-system . . . . . . . . . . . 98
6.2 Application scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.2 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7 Case studies 107
7.1 The Passepartout case study . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 The Moodle case study . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 The CABLE case study . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.2 mH-DOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4 The Shortbread case study . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 120
III

7.4.2 Typical Operation Scenario . . . . . . . . . . . . . . . . . . . 122
7.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8 H-DOSE related tools and utilities 125
8.1 Genetic refinement of semantic annotations . . . . . . . . . . . . . . . 126
8.1.1 Semantics powered annotation refinement . . . . . . . . . . . 128
8.1.2 Evolutionary refiner . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2 OntoSphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2.1 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.2 Implementation and preliminary results . . . . . . . . . . . . . 145
9 Semantics beyond the Web 149
9.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.1.1 Application Interfaces, Hardware and Appliances . . . . . . . 152
9.1.2 Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.3 Communication Layer . . . . . . . . . . . . . . . . . . . . . . 154
9.1.4 Event Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.1.5 House Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.1.6 Domotic Intelligence System . . . . . . . . . . . . . . . . . . . 156
9.1.7 Event Logger . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.1.8 Rule Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.1.9 Run-Time Engine . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.1.10 User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2 Testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2.1 BTicino MyHome System . . . . . . . . . . . . . . . . . . . . 158
9.2.2 Parallel port and LEDs . . . . . . . . . . . . . . . . . . . . . . 158
9.2.3 Music Server (MServ) . . . . . . . . . . . . . . . . . . . . . . 159
9.2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 159
9.3 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
10 Related Works 161
10.1 Automatic annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 162
10.2 Multilingual issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
10.2.1 Term-related issues . . . . . . . . . . . . . . . . . . . . . . . . 164
10.3 Semantic search and retrieval . . . . . . . . . . . . . . . . . . . . . . 165
10.4 Ontology visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 166
10.5 Domotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
11 Conclusions and Future works 172
11.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
11.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
IV

Bibliography 178
A Publications 186
V

Chapter 1
Introduction
The Semantic Web will be an extension of the current Web in which
information is given well-defined meaning, better enabling computers
and people to work in cooperation.
Tim Berners-Lee
The new generation of the Web will enable humans to gain wisdom of
living, working, playing,and learning, in addition to information search
and knowledge queries.
Ning Zhong
1.1 Motivation
Over the last decade the World Wide Web gained a great momentum becoming
rapidly a fundamental part of our everyday life. In personal communication, as well
as in business, the impact of the global network has completely changed the way
people interact with each other and with machines. This revolution touches all the
aspects of people lives and is gradually pushing the world toward a “Knowledge so-
ciety” where the most valuable resources will no longer be material but informative.
Also the way we think of computers has been influenced by this development:
we are in fact evolving from thinking computers as ”calculus engines” to consid-
ering computers as “gateways” or entry points to the newly available information
highways.
The popularity of the Web has lead to the exponential growth of published pages
and services observed in these years. Companies are now offering web pages to adver-
tise and sell their products. Learning institutions are presenting teaching material
and on-line training facilities. Governments provide web-accessible administrative
1

1 – Introduction
services to ease the citizen’s life. Users build up communities to exchange any kind
of information and/or to form more powerful market actors able to survive in this
global ecosystem.
The stunning success is also the curse of the current web: most of todays’s Web
content is only suitable for human consumption, but the huge number of available
information makes it increasingly difficult for users to find and access required infor-
mation. Under these conditions, keyword-based search engines, such as AltaVista,
Yahoo, and Google, are the main tools for using today’s Web. However, there are
serious problems associated with their use:
• High recall, low precision: relevant pages are buried into other thousands
of low interesting pages.
• Low recall: although rarer, sometimes happens that queries get no answers
because they are formulated with the wrong words.
• Results are sensitive to vocabularies, so, for example, the adoption of
different synonymous of the same keyword may lead to different results.
• Searches are for documents and not for information.
Even if the search process is successful, the result is only a “relevant” set of web
pages that the user shall scan for finding the required information. In a sense, the
term used to classify the involved technologies, Information Retrieval, is in this case
rather misleading and Location Retrieval might be better.
The critical point is that, at now, machines are not able, without heuristics
and tricks, to understand documents published on the web and to extract only the
relevant information from pages. Of course there are tools that can retrieve text,
split phrases, count words, etc. But when it comes to interpreting and extracting
useful data for the users, the capabilities of current software are still limited.
One solution to this problem consists in keeping information as it currently
is and in developing sophisticated tools that use artificial intelligence techniques
and computational linguistics to “understand” what is written in web pages. This
approach has been pursued for a while, but, at now, still appears too ambitious.
Another approach is to define the web in a more machine understandable fashion
and to use intelligent techniques to take advantage of this representation.
This plan of revolutionizing the web is usually referred to as Semantic Web
initiative and is only a single aspect of the next evolution of the web, the Wisdom
Web.
It is important to notice that the Semantic Web aims not at being parallel to the
World Wide Web, instead it aims at evolve the Web into a new knowledge centric,
global network. Such a new network will be populated by intelligent web agents able
2

1.2 – Domain
to act on behalf of their human counterparts, taking into account the semantics of
information (meaning). Users will be once more the center of the Web but they will
be able to communicate and to use information with a more human-like interaction
and they will also be provided with ubiquitous access to such information.
1.2 Domain: the long road from today’s Web to
the Wisdom Web
Starting from the current web, the ongoing evolution aims at transforming the to-
day’s syntactic World Wide Web in the future Wisdom Web. The fundamental
capabilities of this new network include:
• Autonomic Web support, allowing to design self-regulating systems able to
cooperate in autonomy with other available applications/information sources.
• Problem Solving capabilities, for specifying, identifying and solving roles,
settings and relationships between services.
• Semantics, ensuring the “right” understanding of involved concepts and the
right context for service interactions.
• Meta-Knowledge, for defining and/or addressing the spatial and temporal
constraints or conflicts in planning and executing services.
• Planning, for enabling services or agents to autonomously reach their goals
and subgoals.
• Personalization, for understanding recent encounters and to relate different
episodes together.
• A sense of humor, so services on the Wisdom Web will be able to interact
with users on a personal level.
These capabilities stem from several, active, research fields including the Multi-
Agent community, the Semantic Web community, the Ubiquitous Computing com-
munity and have impact, or use, technologies developed for databases, computa-
tional grids, social networks, etc. The Widsom Web, in a sense, is the place where
most of the currently available technologies and their evolutions will join in a single
scenario with a “devastating” impact on the human society.
Many steps, however, still separate us from this future web, the Semantic Web
initiative being the first serious and world-wide attempt to build the necessary in-
frastructure.
3

1 – Introduction
Semantics is one of the angular stones of the Wisdom Web. It has its own
foundation on the formal definition of “concepts” involved by web pages and by
services available on the web. Such a formal definition needs two critical compo-
nents: knowledge models and associations between knowledge and resources. The
formers are known as Ontologies while the latter are usually referred to as Semantic
Annotations.
There are several issues related to the introduction of semantics on the web: how
should knowledge be modeled? how to associate knowledge to real world entities or
to real world information?
The Semantic Web initiative builds up on the evidence that creating a common,
monolithic, omni-comprehensive knowledge model is infeasible and, instead, assumes
that each actor playing a role on the web shall be able to define its own model
according to its own view of the world. In the SW vision, the global knowledge
model is the result of a shared effort in building and linking the single models
developed all around the world, in a way like what happened for the current Web.
Of course, in such a process conflicts will arise that will be solved by proper tools
for mapping “conceptually similar” entities in the different models.
The definition of knowledge models alone, does not introduces semantics on the
web; in order to get such a result another step shall be done: resources must be
“associated” to the models. The current trend is to perform this association by
means of Semantic Annotations. A Semantic Annotation is basically a link between
a resource on the web, either a web page or a video or a music piece and one or
more concepts defined by an ontology. So, for example, the pages of a news site can
be associated to the concept news in a given ontology by means of an annotation,
in form of a triple: about(site, news).
Semantic annotations are not only for documents or web pages but can be used
to associate semantics to nearly all kinds of informative, physical and meta-physical
entities in the world.
Many issues are involved in the annotation process. So, just for mentioning some
of them, at which granularity these annotations shall be defined? How can we define
if a given annotation is trusted or not? Who shall annotate resources? Shall be the
creator or anyone on the web?
To answer these and the previous questions standard languages and practices
for semantic markup are needed, together with a formal logic for reasoning about
the available knowledge and for turning implicit information into explicit facts. At
contour, databases engines, for storing semantics rich data, and search engines for
offering new question-answering interfaces, constitute the informatics backbone of
the new information highway so defined.
Other pillars of the forthcoming Wisdom Web, which are more “technical”, in-
clude autonomic systems, planning and problem solving, etc. For them, great im-
provements are currently being provided by the ever active community of Intelligent
4

1.3 – Contribution
and Multi-Agent systems. In this case the problematics involved are slightly differ-
ent from the ones cited above although semantics integration can be interesting in
many aspects. The main research stream is in fact about designing machines able
to think and act rationally. In simple terms, the main concern, in this field, is to
define systems able to autonomously pursue goals and subgoals defined either by
humans or by other systems.
Meta-knowledge plays a crucial role in such a process, providing means for mod-
eling spatial and temporal constraints, or conflicts, that may arise between agent’s
goals and can be, in turn, strongly based on semantics. Knowledge modeling can, in
fact, support the definition and discovery of similarities and relationships between
constraints, in a way that is independent from the dialects with which each single
agent composes its world understanding.
Personalization, in the end, is not a new discipline and is historically concerned
with interaction modalities between users and machines, defining methodologies and
instruments to design usable interfaces, for every people, be they “normal” users or
diversely able people. It has impact on, or interacts with, many research fields
starting from Human Computer Interaction and encompassing Statistical Analysis
of user preferences and Prediction Systems. Personalization is a key factor for the
Wisdom Web exploitation: until usable and efficient interfaces will not be available,
in fact, this new web, in which available information will be many order of magnitude
wider that the current one, will not be adopted.
The good news is that all the cited issues can be solved without requiring a
revolutionary scientific progress. We can in fact reasonably claim that the challenge
is only in engineering and technological adoption, as partial solutions to all the
relevant parts of the scenario already exist. At the present, the greatest needs
appear to be in the area of integration, standardization, development of tools and
adoption by the users.
1.3 Contribution
In this Thesis, methodologies and techniques for paving the way that starts from
nowadays web applications and leads to the Wisdom Web have been studied, with
a particular focus on information retrieval systems, content management systems
and e-Learning systems. A new platform for supporting the easy integration of
semantic information into nowadays systems has been designed and developed, and
has been applied to several case studies: a custom-made CMS, a publicly available
e-Learning system (Moodle [1]), an intelligent proxy for web navigation (Muffin [2])
and a life-long learning system developed in the context of the CABLE project [3]
(a EU-funded Minerva Project).
In addition some extensions of the proposed system to environments sharing
5

1 – Introduction
with the Web the underlying infrastructure and the communication and interaction
paradigms have been studied. A case study is provided for domotic systems.
Several contributions to the state of art in semantic systems can be found in the
components of the platform including: an extension of the T.R. Gruber ontology
definition, which allows to transparently support multilingual knowledge domains, a
new annotation “expansion” system that allows to leverage the information encoded
into ontologies for extending semantic annotations, and a new “conceptual” search
paradigm based on a compact representation of semantic annotations called Con-
ceptual Spectrum. The semantic platform discussed in this thesis is named H-DOSE
(Holistic Distributed Semantic Elaboration Platform) and is currently available as
an Open Source Project on Sourceforge: http://dose.sourceforge.net.
H-DOSE has been entirely developed in Java for allowing better interoperability
with already existing web systems and is currently deployed as a set of web services
running on the Apache Tomcat servlet container. It is, at now, available in two
different forms, one intended for micro enterprises, characterized by a small footprint
on the server onto which is run, and one for small and medium enterprises that
integrates the ability to distribute jobs on different machines, by means of agents,
and that includes principles of autonomic computing for keeping the underlying
knowledge base constantly up-to-date. Rather than being an isolated attempt to
semantics integration in the current web, H-DOSE is still a very active project and is
undergoing several improvements and refinements for better supporting the indexing
and retrieval of non-textual information such as video clips, audio pieces, etc. There
is also some ongoing work on the integration of H-DOSE into competitive intelligence
systems as done by IntelliSemantic: a start-up of the Turin’s Polytechnic that builds
its business plan on the adoption of semantic techniques, and in particular of the
H-DOSE platform, for patent discovery services.
Eventually, several side issues related to semantics handling and deployment on
web applications have been addressed during the H-DOSE design, some of them will
also be presented in this thesis. A newly designed ontology visualization tool based
on multi-dimensional information spaces is an example.
1.4 Structure of the Thesis
The remainder of this thesis is organized as follows:
Chapter 2 introduces the vision of Semantic Web and discusses the data-model,
standards, and technologies used to bring this vision into being. These building
blocks are used in the design of H-DOSE trying to maximize the reuse of already
available and well tested technologies thus avoiding to reinvent the wheel.
Chapter 3 moves in parallel with the preceding chapter introducing an overview
6

1.4 – Structure of the Thesis
of currently available web applications with a particular focus on systems for infor-
mation management such as Content Management Systems, Indexing and retrieval
systems, e-Learning systems. For every category of application, the points in which
semantics can give substantial improvements either in effectiveness (performance)
or in user experience are evidenced.
Chapter 4 defines the requirements for the H-DOSE semantic platform, as they
emerge from interviews with web actors such as content publishers, site administra-
tors and so on.
Chapter 5 introduces the H-DOSE logical architecture, and uses such architec-
ture as a guide for discussing the basic principles and assumptions on to which the
platform is built. For every innovative principle the strength points are evidenced
together with the weaknesses emerged either during the presentations of such ele-
ments in international conferences and workshops or during the H-DOSE design and
development process.
Chapter 6 describes in deep detail the H-DOSE platform, focusing on the role
and the interactions that involve every single component of the platform. The main
concern of this chapter is to provide a complete view of the platform, in its more
specific aspects, discussing the adopted solutions from a “software engineering” point
of view.
Chapter 7 presents the case studies that constituted the benchmark of the
H-DOSE platform. Each case study is addressed separately starting from a brief
description of requirements and going through the integration design process, the
deployment of the H-DOSE platform and the phase of results gathering and analysis.
Chapter 8 is about the H-DOSE related tools developed during the platform
design and implementation. They include a new ontology visualization tool and a
genetic algorithm for semantic annotations refinement.
Chapter 9 discusses the extension of H-DOSE principles and techniques to
non-Web scenarios, with a particular focus on domotics. An ongoing project on
semantics reach house gateways is described highlighting how the lessons learned
in the design and development of H-DOSE can be applied in a complete different
scenario, still retaining their valuability.
Chapter 10 presents the related works in the field of both Semantic Web and
Web Intelligence, with a particular focus on semantic platforms and semantics inte-
gration on the Web.
Chapter 11 eventually concludes the thesis and provides an overview on possible
future works.
7

Chapter 2
The Semantic Web vision
This chapter introduces the vision of Semantic Web and discusses the
data-model, standards, and technologies used to bring this vision into
being. These building blocks are used in the design of H-DOSE trying to
maximize the reuse of already available and well tested technologies thus
avoiding to reinvent the wheel.
The Semantic Web is developed layer by layer; the pragmatic justification for such
a procedure is that it is easier to achieve consensus on small steps, whereas it is
much harder to make everyone agree on very wide proposals. In fact there are many
research groups that are exploring different and sometimes conflicting solutions.
After all, competition is one of the major driving force for scientific development.
Such a competition makes very hard to reach agreements on wide steps and often
only a partial consensus can be achieved. The Semantic Web builds upon the steps
for which consensus can be reached, instead of waiting to see which alternative
research line will be successful in the end.
The Semantic Web is such that companies, research groups and users must build
tools, add content and use that content. It is certainly myopic to wait until the full
vision will materialize: it may take another ten years to realize the full extent of
SW, and many years more for the Wisdom Web.
In evolving from one layer to another, two principles are usually followed:
• Downward compatibility: applications, or agents, fully compliant with a
layer shall also be aware of the lower layers, i.e., they shall be able to interpret
and use information coming from those layers. As an example we can consider
an application able to understand the OWL semantics. The same application
shall also take full advantage of information encoded in RDF and RDF-S [4].
• Partial upward understanding: agents fully aware of a given layer should
8

take, at least partial, advantage from information at higher levels. So, a RDF-
aware agent should also be able to use information encoded in OWL [5], ig-
noring those elements that go beyond RDF and RDF Schema.
Figure 2.1. The Semantic Web ”cake”.
The layered cake of the Semantic Web is shown in Figure 2.1 and describes the
main components involved in the realization of the Semantic Web vision (due to
Tim Berners Lee). At the bottom it is located XML (eXtensible Markup Language)
a language for writing well structured documents according to a user-defined vocab-
ulary. XML is a “de facto” standard for the exchange of information over the World
Wide Web. On the top of XML builds up the RDF layer.
RDF is a simple data model for writing statements about Web objects. RDF is
not XML, however it has a XML-based syntax, so it is located, in the cake, over the
XML layer.
RDF-Schema defines the vocabulary used in RDF data models. It can be seen
as a very primitive language for defining ontologies, as it provides the basic building
blocks for organizing Web objects into hierarchies. Supported constructs include:
classes and properties, the subClass and subProperty relations and the domain and
range restrictions. RDF-Schema uses a RDF syntax.
The Logic layer is used for further enhancing the ontology support offered by
RDF-Schema, thus allowing to model application-specific declarative knowledge.
The Proof layer, instead, involves the process of deductive reasoning as well as
the process of providing and representing proofs in Web languages. Applications
lying at the proof level shall be able to reason about the knowledge data defined in
the lower layers and to provide conclusions together with “explanations” (proofs)
about the deductive process leading to them.
9

2 – The Semantic Web vision
The Trust layer, in the end, will emerge through the adoption of digital signatures
and other kinds of knowledge, based on recommendations by trusted agents, by
rating and certiﬁcation agencies or, even, by consumer organizations. The expression
“Web of Trust” means that the trust over the Web will be organized in the same
distributed and sometimes chaotic way as the WWW itself. Trust is crucial for the
ﬁnal exploitation of the Semantic Web vision: until users will not have trust in its
operations (security) and in quality of information provided (relevance) the SW will
not reach its full potential.
2.1 Semantic Web Technologies
The Semantic Web cake depicted above builds upon the so-called Semantic Web
Technologies. These technologies empower the foundational components of the SW,
which are introduced separately in the following subsections.
2.1.1 Explicit Metadata
At now, the World Wide Web is mainly formatted for human users rather than
for programs. Pages either static or dynamically built using information stored
in databases are written in HTML or XHTML. A typical web page of an ICT
consultancy agency can look like this:
<html>
<head></head>
<body>
<h1> SpiderNet internet consultancy,
network applications and more </h1>
<p> Welcome to the SpiderNet web site, we offer
a wide variety of ICT services related to the net.
<br/> Adam Jenkins, our graphics designer has designed many
of the most famous web sites as you can see in
<a href=’’gallery.html’’>the gallery</a>.
Matt Kirkpatrick is our Java guru and is able to develop
any new kind of functionalities you may need.
<br> If you are seeking a great new opportunity
for your business on the web, contact us at the
following e-mails:
<ul>
<li>jenkins@spidernet.net</li>
<li>kirkpatrick@spidernet.net</li>
10

2.1 – Semantic Web Technologies
</ul>
Or you may visit us in the following opening hours
<ul>
<li>Mon 11am - 7pm</li>
<li>Tue 11am - 2pm</li>
<li>Wed 11am - 2pm</li>
<li>Thu 11am - 2pm</li>
<li>Fri 2pm - 9pm</li>
</ul>
Please note that we are closed every weekend and every festivity.
</p>
</body>
</html>
For people, the provided information is presented in a rather satisfactory way,
but for machines this document results nearly incomprehensible. Keyword-based
techniques might be able to identify the words web site, graphics designer and Java.
And an intelligent agent, could identify the email addresses and the personnel of the
agency, and with a little bit of heuristics it might associate each employee with the
correct e-mail address. But it will have troubles for distinguishing who is the graphics
designer and who is the Java developer, and even more diﬃculties in capturing the
opening hours (for which the agent would have to understand what festivities are
celebrated during the year, and in which days, depending on the location of the
agency, which in turn is not explicitly available in the web page.). The Semantic
Web tries to address these issues not by developing super-intelligent agents able to
understand information as humans. Instead it acts on the HTML side, trying to
replace this language with more appropriate languages so that web pages could carry
their content in a machine processable form, still remaining visually appealing for
the users. In addition to formatting information for human users, these new web
pages will also carry information about their content, such as:
<company type=’’consultancy’’>
<service>Web Consultancy</service>
<products> Web pages, Web applications </products>
<staff>
<graphicsDesigner>Adam Jenkins</graphicsDesigner>
<javaDeveloper>Matt Kirkpatrick</javaDeveloper>
</staff>
</company>
11

This representation is much easier for machines to understand and is usually
known as metadata that means: data about data. Metadata encodes, in a sense,
the meaning of data, so defining the semantics of a web document (thus the term
Semantic Web).
2.1.2 Ontologies
The term ontology stems from philosophy. In that context, it is used to name a
subfield of philosophy, namely the study of the nature of existence (from the greek
oντøλøγια), the branch of metaphysics concerned with identifying, in general terms,
the kinds of things that actually exist, and how to describe them. For example the
observation that the world is made up of specific entities that can be grouped in
abstract classes based on shared properties is a typical ontological commitment.
For what concerns nowadays technologies, ontology has been given a specific
meaning that is quite different from the original one. For the purposes of this thesis
the T.R. Gruber’s definition, later refined by R. Studer can be adopted: An ontology
is an explicit and formal specification of a conceptualization.
In other words, an ontology formally describes a knowledge domain. Typically,
an ontology is composed of a finite list of terms and the relationships between these
terms. The terms denote important concepts (classes of objects) of the domain.
Relationships include, among the others, hierarchies of classes. A hierarchy
specifies a class C to be a subclass of another class C if every object in C is also
included in C . Apart from the subclass relationship (also known as “is A” relation),
ontologies may include information such as:
• properties (X makes Y )
• value restrictions (only smiths can make iron tools)
• disjointness statements (teachers and secretary staff are disjoint)
• specification of logical relationships between objects
In the context of the web, ontologies provide a shared understanding of a domain.
Such an understanding is necessary to overcome differences in terminology. As an
example a web application may use the term “ZIP” for the same information that in
another one is denoted as “area code”. Another problem is when two applications
use the same term with different meanings. Such differences can be overcome by
associating a particular terminology with a shared ontology, and/or by defining
mappings between different ontologies. In both cases, it is easy to notice that
ontologies support semantic interoperability.
Ontologies are also useful for improving the results of Web searches. The search
engine can look for pages that refer to a precise concept, or set of concepts, in
12

2.2 – Logic
an ontology instead of collecting all pages in which certain, possibly ambiguous,
keywords occur. In the same way as above, ontologies allow to overcome differ-
ences in terminology between Web pages and queries. In addition, when performing
ontology-based searches it is possible to exploit generalization and specialization
information. If a query fails to find any relevant documents (or provides too many
results), the search engine can suggest to the user a more general (specific) query
[6]. It is even conceivable that the search engine runs such queries proactively, in
order to reduce the reaction time in case the user adopts such suggestion.
Ontologies can even be used to better organize Web sites and navigation of
them. Many nowadays sites offer on the left-hand side of the pages the top levels
of a concept hierarchy of terms. The user may click on them to expand the sub
categories and to finally reach new pages in the same site.
In the Semantic Web layered approach, ontologies are located in between the
third layer of RDF and RDF-S and the fourth level of abstraction where the Web
Ontology Language (OWL) resides.
2.2 Logic
Logic is the discipline that studies the principles of reasoning; in general, it offers
formal languages for expressing knowledge and well-understood formal semantics.
Logic usually works with the so-called declarative knowledge, which describes what
holds without caring about how it can be deduced.
Deduction can be performed by automated reasoners: software entities that have
been extensively studied in Artificial Intelligence. Logic deduction (inference) allows
to transform implicit knowledge defined in a domain model (ontology) into explicit
knowledge. For example, if a knowledge base contains the following axioms in pred-
icate logic,
human(X) → mammal(X)
Ph.Dstudent(X) → human(X)
Ph.Dstudent(Dario)
an automated inferencing engine can easily deduce that
human(Dario)
mammal(Dario)
Ph.Dstudent(X) → mammal(X)
Logic can therefore be used to uncover ontological knowledge that is implicitly given
and, by doing so, it can help revealing unexpected relationships and inconsistencies.
13

But logic is more general than ontologies and can also be used by agents for
making decisions and selecting courses of action, for example.
Generally there is a trade-off between expressive power and computational ef-
ficiency. The more expressive a logic is, the more computationally expensive it
becomes for drawing conclusions. And drawing conclusions can sometimes be impos-
sible when non-computability barriers are encountered. Fortunately, a considerable
part of the knowledge relevant to the Semantic Web seems to be of a relatively re-
stricted form, and the required subset of logics is almost tractable, and is supported
by efficient reasoning tools.
Another important aspect of logic, especially in the context of the Semantic Web,
is the ability to provide explanations (proofs) for the conclusions: the series of infer-
ences can be retraced. Moreover, AI researchers have developed ways of presenting
proofs in a human-friendly fashion, by organizing them as natural deductions and
by grouping, in a single element, a number of small inference steps that a person
would typically consider a single proof step.
Explanations are important for the Semantic Web because they increase the
users’ confidence in Semantic Web agents. Even Tim Berners Lee speaks of a “Oh
yeah?” button that would ask for explanation.
Of course, for logic to be useful on the Web, it must be usable in conjunction with
other data, and it must be machine processable as well. From these requirements
stem the nowadays research efforts on representing logical knowledge and proofs in
Web languages. Initial approaches work at the XML level, but in the future rules
and proofs will need to be represented at the level of ontology languages such as
OWL.
2.2.1 Agents
Agents are software entities that work autonomously and proactively. Conceptually
they evolved out of the concepts of object-oriented programming and of component-
based software development.
According to the Tim Berners-Lee’s article [7], a Semantic Web agent shall be
able to receive some tasks and preferences from the user, seek information from Web
sources, communicate with other agents, compare information about user require-
ments and preferences, select certain choices, and give answers back to the user.
Agents will not replace human users on the Semantic Web, nor will they necessarily
make decisions. In most cases their role will be to collect and organize information,
and present choices for the users to select from.
Semantic Web agents will make use of all the outlined technologies, in particular:
• Metadata will be used to identify and extract information from Web Sources.
14

2.3 – Semantic Web Languages
• Ontologies will be used to assist in Web searches, to interpret retrieved infor-
mation, and to communicate with other agents.
• Logic will be used for processing retrieved information and for drawing con-
clusions
2.3 Semantic Web Languages
2.3.1 RDF and RDF-Schema
RDF is essentially a data-model and its basic building block is a object-attribute-
value triple, called statement. An example of statement is: Kimba is a Lion.
This abstract data-model needs a concrete syntax to be represented and ex-
changed and RDF has been given a XML syntax. As a result, it inherits the
advantages of the XML language. However it is important to notice that other
representations of RDF, not in XML syntax, are possible, N3 is an example.
RDF is, by itself, domain independent: no assumptions on a particular domain
of application are done. It is up to each user to define the terminology to be used
in his/her RDF data-model using a schema language called RDF-Schema (RDF-S).
RDF-Schema defines the terms that can be used in a RDF data-model. In RDF-S
we can specify which objects exist and which properties can be applied to them, and
what values they can take. We can also describe the relationships between objects
so, for example, we can write: The lion is a carnivore.
This sentence means that all the lions are carnivores. Clearly there is an intended
meaning for the “is a” relation. It is not up to applications to interpret the “is a”
term; its intended meaning shall be respected by all RDF processing softwares. By
fixing the meaning of some elements, RDF-Schema enables developers to model
specific knowledge domains.
The principal elements of the RDF data-model are: resources, properties and
statements.
Resources are the object we want to talk about. Resources may be authors,
cities, hotels, places, people, etc. Every resource is identified by a sort of identity
ID, called URI. URI stands for Uniform Resource Identifier an provides means to
uniquely identify a resource, be it available on the web or not. URIs do not imply
the actual accessibility of a resource and therefore are suitable not only for web
resources but also for printed books, phone numbers, people and so on.
Properties are special kind of resources that describe relations between the
objects of the RDF data-model, for example: “written by”, “eats”, “lives”, “title”,
“color”, “age”, and so on. Properties are also identified by URIs. This choice allows,
from one side to adopt a global, worldwide naming scheme, on the other side to write
statements having a property either as subject or as object. URIs also allow to solve
15

the homonym problem that has been the plague of distributed data representation
until now.
Statements assert the properties of resources. They are object-attribute-value
triples consisting respectively of a resource, a property and a value. Values can
either be resources or literals. Literals are atomic values (string), that can have a
specific XSD type, xsd:double as an example. A typical example of statement is:
the H-DOSE website is hosted by www.sourceforge.net.
This statement can be rewritten in a triple form:
(‘‘H-DOSE web site’’,’’hosted by’’,’’www.sourceforge.net’’)
and in RDF it can be modeled as:
<?xml version=’’1.0’’ encoding=’’UTF-8’’?>
<rdf:RDF
xmlns:rdf=’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’
xmlns:mydomain=’’http://www.mydomain.net/my-rdf-ns’’>
<rdf:Description about=’’http://dose.sourceforge.net’’>
<mydomain:hostedBy>
http://www.sourceforge.net
</mydomain:hostedBy>
</rdf:Description>
</rdf:RDF>
One of the major strength points of RDF is the so-called reification: in RDF it
is possible to make statements about statements, such as:
Mike thinks that Joy has stolen its diary
This kind of statement allows to model belief or trust on other statements, which
is important for some kinds of application. In addition, reification allows to model
non-binary relations using triples. The key idea, since RDF only supports triples,
i.e., binary relationships, is to introduce an auxiliary object and relate it to each of
the parts of the non-binary relation through the properties subject, predicate and
object.
So, for example, if we want to represent the tertiary relationship referee(X,Y,Z)
having the following well defined meaning:
X is the referee in a tennis game between players X and Y.
16

Figure 2.2. Representation of a tertiary predicate.
we have to break it in three binary relations adding an auxiliary resource called
tennisGame, as in Figure 2.2.
RDF critical view
As already cited RDF only uses binary properties. This restriction could be a quite
limiting factor since usually we adopt predicates with more than two arguments.
Fortunately, reification allows to overcome this issue. However some critic aspects
arise from the adoption of the reification mechanism. As first, although the solution
is sound, the problem remains that not binary predicates are more natural with
more arguments. Secondly, reification is a quite complex and powerful technique
which may appear misplaced for a basic layer of the Semantic Web, instead it would
have appeared more natural to include it in more powerful layers that provide richer
representational capabilities.
In addition, the XML syntax of RDF is quite verbose and can easily become too
cumbersome to be managed directly by users, especially for huge data-models. So
comes the adoption of user-friendly tools that automatically translate higher level
representations into RDF.
Eventually, RDF is a standard format therefore the benefits of drafting data in
RDF can be seen as similar to drafting information in HTML in the early days of
the Web.
From RDF/RDF-S to OWL
The expressiveness of RDF and RDF-Schema (described above) is very limited (and
this is a deliberate choice): RDF is roughly limited to model binary relationships and
RDF-S is limited to sub-class hierarchies and property hierarchies, with restrictions
on the domain and range of the lasts.
17

However a number of research groups have identified different characteristic use-
cases for the Semantic Web that would require much more expressiveness than RDF
and RDF-S offer. Initiatives from both Europe and United States came up with
proposals for richer languages, respectively named OIL and DAML-ONT, whose
merging DAML+OIL was taken by the W3C as the starting point for the Web
Ontology Language OWL.
Ontology languages must allow users to write explicit, formal conceptualizations
of domain knowledge, the main requirements are therefore:
• a well defined syntax,
• a formal semantics,
• an efficient reasoning support,
• a sufficient expressive power,
• a convenience of expression.
The importance of a well-defined syntax is clear, and known from the area of pro-
gramming languages: it is a necessary condition for “machine understandability”
and thus for machine processing of information. Both RDF/RDF-S and OWL have
this kind of syntax. A formal semantics allows to describe the meaning of knowledge
precisely. Precisely means that semantics does not refer to subjective intuitions and
is not open to different interpretations by different people (or different machines).
The importance of a formal semantics is well known, for example, in the domain of
mathematical logic. Formal semantics is needed for allowing people to reason about
knowledge. This, for ontologies, means that we may reason about:
• Class membership. If x is an instance of a class C, and C is a subclass of D,
we can infer that x is also an instance of D.
• Equivalence of classes. If a class A is equivalent to a class B, and B is equiv-
alent to C, then A is equivalent to C, too.
• Consistency. Let x be an instance of A, and suppose that A is a subclass of
B ∩ C and of D. Now suppose that B and D are disjoint. There is a clear
inconsistence in our model because A should be empty but has the instance
x. Inconsistencies like this indicate errors in the ontology definition.
• Classification. If we have declared that certain property-value pairs are suffi-
cient conditions for membership in a class A, then if an individual (instance)
x satisfies such conditions, we can conclude that x must be an instance of A.
18

Semantics is a prerequisite for reasoning support. Derivation such as the preced-
ing ones can be made by machines instead of being made by hand. Reasoning is
important because allows to:
• check the consistency of the ontology and of the knowledge model,
• check for unintended relationships between classes,
• automatically classify instances.
Automatic reasoning allows to check much more cases than could be checked man-
ually. Such checks become critical when developing large ontologies, where multiple
authors are involved, as well as when integrating and sharing ontologies from various
sources.
Formal semantics is obtained by defining an explicit mapping between an ontol-
ogy language and a known logic formalism, and by using automated reasoners that
already exist for that formalism. OWL, for instance, is (partially) mapped on de-
scription logic, and makes use of existing reasoners such as Fact, Pellet and RACER.
Description logics are a subset of predicate logic for which efficient reasoning support
is possible.
2.3.2 OWL languages
The full set of requirements for an ontology language are: efficient reasoning support
and convenience of expression, for a language as powerful as the combination of
RDF-Schema with full logic. These requirements have been the main motivation
for the W3C Ontology Group to split OWL in three different sub languages, each
targeted at different aspects of the full set of requirements.
OWL Full
The entire Web Ontology Language is called OWL Full and uses all the OWL lan-
guages primitives. It allows, in addition, the combination of these primitives in
arbitrary ways with RDF and RDF-Schema. This includes the possibility (already
present in RDF) of changing the meaning of the predefined (RDF and OWL) prim-
itives, by applying the language primitives to each other. For example, in OWL full
it is possible to impose a cardinality constraint on the Class of all classes, essentially
limiting the number of classes that can be described in an ontology.
The advantage of OWL Full is that it is fully upward-compatible with RDF,
both syntactically and semantically: any legal RDF document is also a legal OWL
Full document, and any valid RDF/RDF-S conclusion is also a valid OWL Full
conclusion. The disadvantage of the OWL Full is that the language has become
19

so powerful as to be undecidable, so dashing any hope of complete (or efficient)
reasoning support.
OWL DL
In order to re-obtain computational efficiency, OWL DL (DL stands for Description
Logic) is a sub language of OWL Full that restricts how the constructors from
RDF and OWL may be used: application of OWL’s constructors to each other is
prohibited so as to ensure that the language corresponds to a well studied description
logic.
The advantage of this is that it permits efficient reasoning support. The dis-
advantage is the lost of full compatibility with RDF: an RDF document will, in
general, have to be extended in some ways and restricted in others before becoming
a legal OWL DL document. Every OWL DL document is, in turn, a legal RDF
document.
OWL Lite
An even further restriction limits OWL DL to a subset of the language constructors.
For example, OWL Lite excludes enumerated classes, disjointness statements, and
arbitrary cardinality.
The advantage of this is a language that is both easier to grasp for users and easier
to implement for developers. The disadvantage is of course a restricted expressivity.
2.3.3 OWL in a nutshell
Header
OWL documents are usually called OWL ontologies and they are RDF documents.
The root element of an ontology is an rdf:RDF element, which specifies a number
of namespaces:
<rdf:RDF
xmlns:owl = ’http://www.w3.org/2002/07/owl#’’
xmlns:rdf = ’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’
xmlns:rdfs = ’’http://www.w3.org/2000/01/rdf-schema#’’
xmlns:xsd = ’’http://www.w3.org/2001/XMLSchema#’’>
An OWL ontology can start with a set of assertions for house keeping purpose. These
assertions are grouped under an owl:Ontology element, which contains comments,
version control, and inclusion of other ontologies.
20

<owl:Ontology rdf:about=’’ ’’>
<rdfs:comment>A simple OWL ontology </rdfs:comment>
<owl:priorVersion
rdf:resource=’’http://www.domain.net/ontologyold’’/>
<owl:imports
rdf:resource=’’http://www.domain2.org/savanna’’/>
<rdfs:label>Africa animals ontology</rdfs:label>
</owl:Ontology>
The most important of the above assertions is the owl:imports, which lists other
ontologies whose content is assumed to be part of the current ontology. It is impor-
tant to be aware that the owl:imports is a transitive property: if the ontology A
imports the ontology B, and the ontology B imports the ontology C, then A is also
importing C.
Classes
Classes are defined using the owl:Class element and can be organized in hierarchies
by means of the rdfs:subClassOf construct.
<owl:Class rdf:ID=’’Lion’’>
<rdfs:subClassOf rdf:resource=’’#Carnivore’’/>
</owl:Class>
It is also possible to indicate that two classes are completely disjoint such as the
herbivores and the carnivores, using the owl:disjointWith construct.
<owl:Class rdf:about=’’#carnivore’’>
<owl:disjointWith rdf:resource=’’#herbivore’’/>
<owl:disjointWith rdf:resource=’’#omnivore’’/>
</owl:Class>
Equivalence of classes may be defined using the owl:equivalentClass element.
Eventually there are two predefined classes, owl:Thing and owl:Nothing, which,
respectively, indicate the most general class containing everything in a OWL doc-
ument, and the empty class. As a consequence, every owl:Class is a subclass of
owl:Thing and a superclass of owl:Nothing.
Properties
In OWL are defined two kinds of properties:
21

• Object properties, which relate objects to other objects. Example is, in the
savanna ontology, the relation eats.
• Datatype properties, which relate objects with datatype values. Examples are
age, name, and so on. OWL has not any predefined data types, nor does it
provide special definition facilities. Instead, it allows the use of XML-Schema
datatypes, making use of the layered architecture of the Semantic Web.
Here there are two examples, the first for a Datatype property while the second is
for Object properties:
<owl:DatatypeProperty rdf:ID=’’age’’>
<rdfs:range rdf:resource=’’&xsd;#nonNegativeInteger’’/>
</owl:DatatypeProperty>
<owl:ObjectProperty rdf:ID=’’eats’’>
<rdfs:domain rdf:resource=’’#animal’’/>
</owl:ObjectProperty>
More than one domain and range can be declared, in such a case the intersection
of the domains (ranges) is taken. OWL allows to identify “inverse properties”, for
them a specific owl element exists (owl:inverseOf) and has the effect of relating a
property with its inverse by inter-changing the domain and range definitions.
<owl:ObjectProperty rdf:ID=’’eatenBy’’>
<owl:inverseOf rdf:resource=’’#eats’’/>
</owl:ObjectProperty>
Eventually equivalence of properties can be defined through the use of the element
owl:equivalentProperty.
Restrictions on properties
In RDFS it is possible to declare a class C as a subclass of a class C , then every
instance of C will be also an instance of C . OWL allows to specify classes C
that satisfy some precise conditions, i.e., all instances of C satisfy the conditions.
This is done by defining C as a subclass of the class C which collects all the
objects that satisfy the conditions. In general, C remains anonymous. In OWL
there are three specific elements for defining classes basing on restrictions, they
are owl:allValuesFrom, owl:someValuesFrom and owl:hasValue, and they are
always nested into a owl:Restriction element. The owl:allValuesFrom specify a
universal quantification (∀) while the owl:someValuesFrom defines and existential
quantification (∃).
22

<owl:Class rdf:about=’’#firstYearCourse’’>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=’’#isTaughtBy’’/>
<owl:allValuesFrom
rdf:resource=’’#Professor’’/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
This example requires every person who teaches an instance of “firstYearCourse”,
e.g., a first year subject, to be a professor (universal quantification).
<owl:Class rdf:about=’’#academicStaffMember’’>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=’’#teaches’’/>
<owl:someValuesFrom
rdf:resource=’’#undergraduateCourse’’/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
This second example, instead, requires that there exist an undergraduate course
taught by an instance of the class of academic staff members (existential quantifi-
cation).
In general an owl:Restriction element contains an owl:onProperty element
and one ore more restriction declarations. Restrictions for defining the cardinality
of a given class are also supported through the elements:
• owl:minCardinality,
• owl:maxCardinality,
• owl:Cardinality.
The latter is a shortcut for a cardinality definition in which owl:minCardinality
and owl:maxCardinality assume the same value.
Special properties
Some properties of the property element can be defined directly:
23

• owl:TransitiveProperty defines a transitive property, such as “has better
grade than”, “is older than”, etc.
• owl:SymmetricProperty defines a symmetric property, such as “has same
grade as” or “is sibling of”.
• owl:FunctionalProperty defines a property that has at most one value for
each object, such as “age”, “height”, “directSupervisor”, etc.
• owl:InverseFunctionalProperty defines a property for which two different
objects cannot have the same value, for example “is identity ID for”.
Instances
Instances of classes, in OWL, are declared as in RDF:
<rdf:Description rdf:ID=’’Kimba’’>
<rdf:type rdf:resource=’’#Lion’’/>
</rdf:Description>
OWL, unlike typical database systems, does not adopt a unique-names assump-
tion therefore two instances that have different names are not required to be actually
two different individuals. Then, to ensure that different individuals are recognized
by automated reasoners as such, inequality must be explicitly asserted.
<lecturer rdf:ID=’’91145’’>
<owl:differentFrom rdf:resource=’’#98760’’/>
</lecturer>
Because such inequality statements frequently occur, and the required number of
statements would explode for stating the inequality of a large number of individual,
OWL provides a shorthand notation to assert the pairwise inequality for all the
individuals in a list: owl:AllDifferent.
<owl:AllDifferent>
<owl:distinctMembers rdf:parseType=’’Collection’’>
<lecturer rdf:about=’’#91345’’/>
</owl:distinctMembers>
</owl:AllDifferent>
Note that owl:distinctMembers can only be used in combination with the
owl:AllDifferent element.
24

Chapter 3
Web applications for Information
Management
This chapter introduces an overview of currently available web appli-
cations with a particular focus on systems for information management
such as Content Management Systems, Indexing and retrieval systems, e-
Learning systems. For every category of applications the points in which
semantics can give substantial improvements either in effectiveness (per-
formance) or in user experience are evidenced.
Many businesses and activities, either on the web or not, are human and knowledge
intensive. Examples include consulting, advertising, media, high-tech, pharmaceuti-
cal, law, software development, etc. Knowledge intensive organizations have already
found that a large number of problems can be attributed to un-captured and un-
shared product and process knowledge, as well as to lacks of “who knows what”
information, to the need to capture lessons learned and best practices, and to the
need of more effective distance collaboration.
These realizations are leading to a growing call for knowledge management.
Knowledge capture and learning can happen ad-hoc (e.g. knowledge sharing around
the coffee maker or problem discussions around the water cooler). Sharing, however,
is more efficient when organized; moreover, knowledge must be captured, stored and
organized according to the context of each company, or organization, in order to be
useful and efficiently disseminated.
The knowledge items that an organization usually needs to manage can have
different forms and contents. They include manuals, correspondence with vendors
and customers, news, competitors intelligence, and knowledge derived from work
processes (e.g. documentation, proposals, project plans, etc.), possibly in different
formats (text, pictures, video). The amount of information and knowledge that a
modern organization must capture, store and share, the geographic distribution of
25

3 – Web applications for Information Management
sources and consumers, and the dynamic evolution of information makes the use of
technology nearly mandatory.
3.1 The knowledge sharing and life cycle model
One of the most adopted knowledge sharing model has been developed by Nonaka
and Takenuchi in 1995 [8] and is called the “tacit-explicit model” (see Figure 3.1).
Referring to this model, tacit knowledge is knowledge that rests with the employee,
Figure 3.1. The Tacit-Explicit Model.
and explicit knowledge is knowledge that resides in the knowledge base. Conversion
of knowledge from one form to another leads often to the creation of new knowledge;
such conversion may follow four diﬀerent patterns:
• Explicit-to-explicit knowledge conversion, or “Combination”, is the reconﬁgu-
ration of explicit knowledge through sorting, adding, combining and catego-
rizing. Often this process leads to new knowledge discovery.
• Explicit-to-tacit knowledge conversion, or “Internalization”, takes place when
one assimilates knowledge acquired from knowledge items. This internalization
contributes to the user’s tacit knowledge and helps him/her in making future
decisions.
26

3.2 – Software tools for knowledge management
• Tacit-to-explicit knowledge conversion, or “Externalization”, involves trans-
forming context-based facts into context-free knowledge, with the help of
analogies. Tacit knowledge is usually personal and depends on the person’s
experiences in various conditions. As a consequence, it is a strongly contextu-
alized resource. Once explicit, it will not retain much value unless context in-
formation is somehow preserved. Externalization can take two forms: recorded
or unrecorded knowledge.
• Tacit-to-tacit knowledge conversion or “Socialization”, occurs by sharing ex-
periences, by working together in a team, and, more in general, by direct
exchange of knowledge. Knowledge exchange at places where people social-
ize, such as around the coffeemaker or the water cooler, leads to tacit-to-tacit
conversion.
The “knowledge life cycle” takes the path of knowledge creation/acquisition, of
knowledge organization and storage, of distribution of knowledge, of application
and reuse, and finally ends up in creation/acquisition again, in a sort of a spiral
shaped process (Figure 3.1).
Tacit knowledge has to be made explicit in order to be captured and made
available to all the actors of a given organization. This is accomplished with the aid
of knowledge acquisition or knowledge creation tools. Knowledge acquisition builds
and evolves the knowledge bases of organizations. Knowledge organization/storage
takes place through activities by which knowledge is organized, classified and stored
in repositories. Explicit knowledge needs to be organized and indexed for easy
browsing and retrieving. It must be stored efficiently to minimize the required
storage space.
Knowledge can be distributed through various channels such as training pro-
grams, automatic knowledge distribution systems and knowledge-based expert sys-
tems. Without regarding the way knowledge is distributed, making the knowledge
base of a given organization available to the users, i.e., distributing the right infor-
mation at the right place and time, is one of the most critical assets for a nowadays
company.
3.2 Software tools for knowledge management
Knowledge management (KM) shall be supported by a collection of technologies for
authoring, indexing, classifying, storing, contextualizing and retrieving information,
as well as for collaboration and application of knowledge. A friendly front-end and
a robust back-end are the basis for KM software tools. Involved elements include,
among the others, content management systems for efficiently publishing and sharing
knowledge and data sources, indexing, classification and retrieval systems to easy
27

access information stored in knowledge bases, e-Learning systems for allowing users
to perform “Internalization”, possibly in a personalized way.
These technologies, are still far from being completely semantic, i.e. based on
context and domain aware elements able to fully support knowledge operations at
a more high level, similar to what usually done by humans and, at the same time,
being still accessible to machines.
The following sub-sections take as a reference three wide diffused technologies
such as content management systems (CMS), search engines and e-Learning systems,
and for each of them discuss the basic principles and “how and where” semantics
can improve their performance. Performance is evaluated both from the functional
point of view and from the knowledge transfer point of view.
For each technology, a separated subsection is therefore provided with a brief
introduction, a discussion about currently available solutions and some considera-
tions about the integration of semantic functionalities. Finally, the shortcomings
and solutions identified will define the requirements for a general purpose platform
able to provide semantics integration on the web, with a minimal effort.
3.2.1 Content Management Systems (CMS)
In terms of knowledge management, the documents that an organization produces,
and publishes, represent its explicit knowledge. New knowledge can be created by
efficiently managing document production and classification: for example, de-facto
experts can be identified based on authorship of documents. Document and, more in
general, content management systems (CMS) enable explicit-to-explicit knowledge
conversion.
A CMS is, in simple terms, a software designed to manage a complete web site
and the related information. It allows to keep track of changes, by recording who
changed what, and when, and can allow notes to be added to each managed resource.
A writer can create a new page (with standard navigation bars and images on each
page) and submit it, without using HTML specific software. In the same way, an
editor can receive all the proposed changes and approve them, or he/she can send
them back to the writer to be corrected. Many other functionalities are supported
such as sending e-mails, contacting other users of the CMS, setting up forums or
chats, etc. In a sense, a CMS is the central point for “presentation independent”
exchange of information on the web and/or on organizations intra-nets.
The separation between presentation and creation of published resources is, in-
deed, one of the added value of CMS adoption. CMS allows editors to concentrate
their efforts on content production while the graphical presentation issues are car-
ried out by the system itself, so that the resulting web site is both coherent and
homogeneously organized. The same separation allows to better organize the web
site navigation and to define the cognitive path that users are expected to follow
28

when browsing the available pages.
CMS common features
Almost all CMSs help organizations to achieve the following goals:
• Streamline and automate content administration. Historically, Web content
has consisted of static pages/files of HTML, requiring HTML programming ex-
perience and manual updating of content and design: clearly a time-consuming
and labor-intensive process. In contrast, CMSs significantly reduce this over-
head by hiding the complexities of HTML and by automating the management
of content.
• Implement Web-forms-based content administration. In an ideal CMS, all con-
tent administration is performed through Web forms using a Web browser.
Proprietary software and specialized expertise (such as HTML) are not re-
quired for content managers. Users simply copy and paste existing content or
fill in the blanks on a form.
• Distribute content management and control. The Web manager has often been
a critical bottleneck in the timely publication and ongoing maintenance of Web
content. CMSs remove that bottleneck by distributing content management
responsibilities to individuals throughout the organization. Those individuals
who are responsible for content, now have the authority and tools to maintain
the content themselves, without any knowledge of HTML, graphic design, or
Web publishing.
• Separate content from layout and design. In a CMS, content is stored sepa-
rately from its publication format. Content managers enter the content only
once, but it can appear in many different places, formatted using very differ-
ent layouts and graphic designs. All the pages immediately reflect approved
content changes.
• Create reusable content repositories. CMSs allow for reuse of content. Objects
such as templates, graphics, images, and content are created and entered once
and then reused as needed throughout the Web site.
• Implement central graphic design management. Graphic design in a CMS
becomes template-driven and centrally managed. Templates are the structures
that format and display content following a request from a user for a particular
Web page. Templates ensure a consistent, professional look and feel for all
content on the site. They also allow for (relatively) easy and simultaneous
modification of an entire site graphic design.
29

• Automate workflow management. Good CMSs enable good workflow pro-
cesses. In the most complex workflow systems, at least three different in-
dividuals create, approve, and publish a piece of content, working separately
and independently. A good workflow system expedites the timely publica-
tion of content by alerting the next person in the chain when an action is
required. It also ensures that content is adequately reviewed and approved
before publication.
• Build sophisticated content access and security. Good CMSs allow for sophis-
ticated control of content access, both for content managers who create and
maintain content and for users who view and use it. Web managers should be
able to define who has access to different types of information and what type
of access each person has.
• Make content administration database-driven. In a CMS static, flat, HTML
pages no longer exist. Instead, the system places most content in a rela-
tional database capable of storing a variety of text and binary materials. The
database then becomes the central repository for contents, templates, graphics,
users and metadata.
• Include structures to collect and store metadata. Because data is stored sepa-
rately from both layout and design, the database also stores metadata describ-
ing and defining the data, usually including author, creation date, publication
and expiration dates, content descriptions and indexing informations, cate-
gories information, revision history, security and access information, and a
range of other content-related data.
• Allow for customization and integration with legacy systems. CMSs allow cus-
tomization of the site functionality through programming. They can expose
their functionalities through an application programming interface (API) and
they can coexist with, and integrate, already deployed legacy systems.
• Allow for archiving and version control. High-end CMSs usually provide mech-
anisms for storing and managing revisions to content. As changes are made,
the system stores archives of the content and allows reversion of any page to
an earlier version. The system also provides means for pruning the archived
content periodically, preferably on a base of criteria including age, location,
number of versions, etc.
Logical architecture
A typical CMS is organized as in Figure 3.2.
30

Figure 3.2. A typical CMS architecture.
It is composed of five macro-components: the Editing front-end, the Site or-
ganization module, the Review system, the Theme management module and the
Publication system.
The Editing front-end is the starting point of the publishing chain implemented
by the CMS. This component usually allows “journalists”, i.e., persons that produce
new contents, to submit their writings. After submission, the new content is adapted
for efficient storage and, if a classification module exists, it is indexed (and this
could be done automatically or manually) in order to lately allow effective retrieval
of stored resources.
The Review system constitutes the other side of the submission process, and it is
designed to support the work flow of successive reviews that occur between the first
submission of a document and the final publication. Users allowed to interact with
this component are usually more expert users, which have the ability/authority to
review the contents submitted by journalists. They can approve the submitted data
or they can send it back to the journalists for further modifications. The approval
process is differently deployed depending both on the CMS class (high-end, middle
level or entry-level) and on the redaction paradigm adopted: either based on a single
review or on multiple reviews.
The Site organization module provides the interface for organizing the published
information into a coherent and possibly usable web site. This module is specifically
targeted at the definition of the navigation patterns to be proposed to the final users,
i.e. to the definition of the site map. Depending on the class of the CMS system, the
31

site organization is either designed by journalists or it is proposed by journalists and
subsequently reviewed by editors, possibly in a complete iterative cycle. Since many
conflicting site maps may arise, even with few journalists, the nowadays content
management systems usually adopt the latter publication paradigm.
The Theme management component is charged of the complete definition of the
site appearance. Graphic designers interact with this module by loading pictures
and graphical decorations, by defining publication styles, by creating dynamic and
interactive elements (menus for example), etc. The graphical aspect of a site is a
critical asset both for the site success (for sites published on the Web) and for the site
effectiveness in terms of user experience (usability and accessibility issues), therefore
care must paid when developing themes and decorations. Usually the theme creation
is not moderated by the CMS. In addition, many of the currently available systems
do not allow on-line editing of graphical presentations. Instead, each designer shall
develop its own theme (or its part of a theme) and shall upload such theme on the
CMS as a whole element. The final publication is subject to the editor approval,
however, there are usually no means, for an editor, to set up an iterative refinement
cycle similar to the document publication process.
The Publication system is both the most customizable and the most visible com-
ponent of a CMS. Its main duty is to pick-up and publish resources from the CMS
document base. Only approved documents can be published while the resources un-
der review should remain invisible to the users (except to journalists and editors).
The publishing system adopts the site map defined by means of the organization
module, and stored in the CMS database, to organize the information to be pub-
lished. Links between resources can either be defined at publication time or can be
automatically defined at runtime according to some (complex) rules. The graphical
presentation of pages depends on the theme defined by graphic designers and ap-
proved by the editors. Depending on company needs and mission, pages can result
completely accessible, even to people with disabilities (this is mandatory for web
sites providing services to people, and is desirable for every site on the Web) or
completely not accessible.
In addition to the publication of documents edited by the journalists, the sys-
tem can offer to the final viewers much more services, depending on the sup-
ported/installed sub modules. So, for example, a typical CMS is able to offer mul-
tiple, topic-centered forums, mailing lists, instant messaging, white boards between
on-line users, cooperation systems such as Wikis or Blogs, etc.
Semantics integration
As shown in the aforementioned paragraphs, CMSs are, at the same time, effec-
tive and critical components for knowledge exploitation, especially for explicit to
32

explicit conversion (Combination) and for explicit to tacit conversion (Internaliza-
tion). They often offer some kind of metadata-managing functions, allowing to keep
track of authors of published data, of creation, publication and expiration dates of
documents, and of information for indexing and categorizing the document base.
This information is only semantically consistent with the internal CMS database,
i.e., it roughly corresponds to the fields of the CMS database. As shown by many
years of research in database systems, this is actually a form of semantics, however it
is neither related with external resources nor with explicitly available models. Stored
meta-information, although being meaningful inside the CMS-related applications,
will thus not be understandable for external applications, making the whole system
less inter-operable.
A Semantic Web system, as well as a future Wisdom Web system, relates, in-
stead, its internal knowledge to well known models, where possible, such as the
Dublin Core for authorship of documents. Even when sufficiently detailed models
are not available, and must be developed “from scratch”, the way metadata is for-
malized follows a well defined standard and it is understood by all “semantic-aware”
softwares and architectures. Given that all Semantic-Web application shall manage
at least RDF/S and the related semantics, interoperability is automatically granted.
Indexing and searching the CMS base can also take advantage of semantic infor-
mation. As for metadata (which is in turn strongly related to indexing and retrieval)
current systems already provide several, advanced facilities for allowing users to store
and retrieve their data, in an effective way. Unfortunately, the adopted technologies
are essentially keyword-based.
A keyword based information retrieval sub-system is based on the occurrence of
specific words (called keywords) inside the indexed documents. The most general
words such as articles, conjunctions, and the like, are usually discarded as they are
not useful for distinguishing documents. While, the remaining terms are collected
and inversely related to each resource in the CMS base. So, in the end, for each
word a list of documents in which the word occurs is compiled.
Whenever a user performs a query, the query terms are matched against the term
list stored by the retrieval subsystem and the correspondent resources are retrieved
according to properly defined ranking mechanisms. Besides the accuracy and effi-
ciency of the different systems available at now, they all share a common and still
problematic issue: they are vocabulary dependent. Being based on keywords found
in the managed document base, these engines are not able to retrieve anything if the
same keywords do not occur both in the user query and in the stored documents.
A more extensive discussion about these topics is available in the Information
Retrieval systems sub-section, however it is easy to notice that semantics integration
can alleviate, if not completely address, these issues by abstracting both queries and
document descriptions to “semantic models”. The matching operation is, in this
case, performed at the level of conceptual models, and if the conversion between
33

either the query or the document content, and the corresponding conceptual model
has been effectively performed, matches can be found independently from vocabular-
ies. This interesting capability of finding resources independently from vocabularies
can be leveraged by “Semantic Web” CMSs to offer functionalities much more ad-
vanced than now. Language independence, for example, can be easily achieved. So
a user can write a query in a given language, which will likely be his/her mother
language, and can ask the system to retrieve data both in the query language and
in other languages that he/she can understand. Matching queries and documents
at the conceptual level makes this process fairly easy. The currently available sys-
tems, instead, can usually provide results only for same language of the query (in
the rather fortunate case in which the query language corresponds to one of the
languages adopted for keyword-based indexing).
Other retrieval related functionalities include the contextual retrieval of “seman-
tically related” pages during user navigation. When a user requires a given page of
the site managed by a semantic CMS, the page conceptual description is picked up
and used to retrieve links to pages, in the site, that have conceptual descriptions
similar to the one of the required page. This allows for example to browse the pub-
lished site by similarity of pages rather than by following the predefined interaction
scenario fixed by the site map.
The impact of semantics on the CMS technology is not limited to storage, classi-
fication and retrieval of resources. Semantics can also be extremely useful in defining
the organization of published sites and in defining the expected work flow for re-
sources produced and reviewed within the CMS environment. There are several
attempts to provide the first CMS implementations in which content is automat-
ically organized and published according to a given ontology. In the same way,
ontologies are used for defining the complex interactions that characterize the docu-
ment review process and the expected steps required for a document to be published
by the CMS.
In conclusion, the introduction of semantics handling in content management
systems can provide several advantages both for what concerns document storage
and retrieval and for what concerns the site navigation and the site publication work
flow.
3.2.2 Information Retrieval systems
Information Retrieval has been one of the most active research streams during the
past decade. It still permeates almost all web-related applications providing either
means, methodologies or techniques for easily accessing resources, be they human
understandable resources, database records or whatever. Information retrieval deals
with the problem of storing, classifying and effectively retrieving resources, i.e.,
information, in computer systems. The find utility in Unix or the small Microsoft’s
34

dog are very simple examples of information retrieval systems. More qualified and
probably more diffused examples are also available, Google [9] above all.
A simple information retrieval system works on the concepts of document in-
dexing, classification and retrieval. These three processes are at the basis of every
computer-based search system. For each of the three, several techniques have been
studied starting from the preliminary heuristic-based approaches until the nowadays
statistical and probabilistic methods. The logical architecture of a typical informa-
tion retrieval system is shown in Figure 3.3.
Figure 3.3. The Logical Architecture of a typical Information Retrieval system.
Many blocks can be identified: the Text Operations block performs all the operations
required for adapting the text of documents to the indexing process. As an example
in this block stop words are removed and the remaining words are stemmed. The
Indexing block basically constructs an inverted index of word-to-document pointers.
The searching block retrieves all the documents that contain a given query token
from the inverted index. The ranking block, instead, ranks all the retrieved docu-
ments according to a similarity measure which evaluates how much documents are
similar to queries. The user interface allows users to perform queries and to view
results. Sometimes it also supports some relevance feedback which allows users to
improve the search performances of the IR system by explicitly stating which re-
sults are relevant and which not. In the end, the query operations block transforms
the user query to improve the IR system performances. For example a standard
thesaurus can be used for expanding the user query by adding new relevant terms,
or the query can be transformed by taking into account users’ suggestions coming
from a relevance feedback.
Describing in detail a significant part of all available approaches to information
retrieval and their variants, would require more than one thesis alone. In addition,
35

the scope of this section is not to be exhaustive with respect to available technologies,
solutions, etc. Instead, the main goal is to provide a rough description of how an
information retrieval system works and a glimpse of what advantages can be implied
by semantics adoption in Information Retrieval. For the ones more interested in this
topic, the bibliography section reports several interesting works that can constitute
a good starting point for investigation. Of course the web is the most viable mean
for gathering other resources.
For the sake of simplicity this section prosecutes by adopting the tf·idf weighting
scheme and the vector space model as guiding methodology and tries to generalize
the provided considerations whenever it is possible.
Indexing
In the indexing process each searchable resource is analyzed for extracting a suitable
description. This description will be, in turn, used by the classification process and
by the retrieval process. By now we restrict the description of indexing at text-
based documents, i.e. at documents which mainly contain human understandable
terms. In this case, indexing intuitively means taking into account, in some ways,
the information conveyed by the words contained in the document to be indexed.
As humans can understand textual documents, information is indeed contained into
them, in a somewhat encoded form. The indexing goal is to extract this information
and to store it in a machine processable form. In performing this extraction two
main approaches are usually adopted: the first one tries to mimic what humans
do and leads to the wide and complex study of Natural Language Processing. The
second, instead, uses information which is much easier for machines to understand,
such as statistical correlation between occurring words, term frequency and so on.
This last solution is actually the one adopted by nowadays retrieval systems, while
the former finds its application only on more restricted search fields where specific
and rather well defined sub-languages can be found. The tf · idf indexing scheme is
a typical example of “machine level” resource indexing.
The base assumption of tf ·idf, and of other more sophisticated methods, is that
information in textual resources is encoded in the adopted terms. The more specific
a term is, the more easily the argument of a document can be inferred. So the main
indexing operations deal with words occurring in resources being analyzed, trying
to extract only the relevant information and to discard all the redundancies typical
of a written language.
In the tf · idf case, the approach works by inspecting the document terms. As
first, all the words that usually convey little or no information such as conjunctions,
articles, adverbs, etc. are removed. They are the so-called stop words and typically
depend on the language in which the given document is written. Removing the
stop words allows to adopt frequency based methods without data being polluted
36

by non-significant information uniformly occurring in all the documents.
Once purged documents from stop words, the tf · idf method evaluates the
frequency of each term occurring in the document, i.e., the number of times that
the word occurs inside the document, with respect to the more frequent term. In
the simplest implementation of tf · idf a vocabulary L defines the words for which
this operation has to be performed.
Let ti be the i − th term of the vocabulary L. The term frequency tfi of the
term ti in the document di is defined as:
tfi(d) =
ti ∈ d
max( tj ∈ d)
The term frequency alone is clearly a too simplistic feature for characterizing a
textual resource. Term frequency, in fact, is only a relative measure of how much
important is (statistically speaking) a word in a document. However, no information
is provided on the ability of the given word to discriminate the analyzed document
from the others. Therefore a weighting scheme shall be adopted, which takes into
account the frequency with which the same term occurs in the documents base. This
weighting scheme is materialized by the inverse document frequency term idf. The
inverse document frequency takes into account the relative frequency of the term ti
with respect to the documents already indexed. Formally:
idfti
= log(
dk ∈ D
D tfi(dk)
)
The two terms, i.e., the tf and the idf values are combined into a single value
called tf ·idf which describes the ability of the term ti to discriminate the document
di from the others.
tf · idfti
(d) = tfi(d) · idfti
The set of tf · idf values, for each term of the vocabulary L, for a document d,
defines the d representation inside the Information Retrieval system.
It must be noted that this indexing process is strongly vocabulary dependent:
words not occurring in L are not recognized and if they are used in queries they do
not lead to results. The same holds for more complex methods where L is built from
the indexed documents or from a training set of documents: words not occurring
in the set of documents analyzed are not took into account. So, for example, if in
a set of textual resources only the word horse occurs, a query for stallion will not
provide results, even if horse and stallion can be used as synonyms.
NLP is expected to solve these problems, however its adoption in information
retrieval systems still appears immature.
37

Classification and Retrieval
In this section classification and retrieval will be described in parallel. Although
they are quite different processes, the latter can be seen as a particular case of the
former where the category definition is given at runtime, by the user query. It shall
be stressed that this section, as the preceding one, does not aim at being complete
and exhaustive, instead, it aims at making clear some shortcomings of the retrieval
process which are in common with classification and which can be improved by
semantics adoption. According to the previous subsection, the tf · idf method and
the Vector Space model [10] are adopted as reference implementations.
After the indexing process, each document, in the knowledge base managed by
an IR system, has associated a set of features describing its content. In classification,
these features are compared to a class definition (either predefined or learned through
clustering) to evaluate whether documents belong or not to the class. In retrieval,
instead, the same set is compared against a set of features specified by a user in
form of query.
The retrieval (classification) process defines how the comparison shall be per-
formed. In doing so, a similarity measure shall be defined allowing to quantitatively
measure the distance between document descriptions and user queries or category
definitions.
The similarity Sim(di,dj) defines the distance, in terms of features, between
resources in a given representation space. Such a measure is usually normalized:
resources having the same description in terms of modeled features get a similarity
score of 1, while resources completely dissimilar receive a similarity score of 0. Please
note that a similarity measure of 1 does not means that compared resources are
exactly equal to each other. The similarity measure, in fact, works only on the
resources features, and two resources can have the same features without being
equal. However, the underlying assumption, is that, although diverse, resources
with similar features are “about” the same theme, from a human point of view.
Therefore, the more high is the similarity between two resources, the more high is
the probability that they have something in common.
The Vector Space model is one of the most diffused retrieval and classification
model. It works on a vectorial space defined by the document features extracted
during the indexing process. In the Vector Space model, the words belonging to the
vocabulary L are considered as the base of the vectorial space of documents d and
queries q. Documents and queries are in fact expressed in terms of words ti ∈ L and
can therefore be represented in the same space (Figure 3.4). Representing documents
and queries (or class definitions) in the same vectorial space allows to evaluate
the similarity between these resources in a quite straightforward manner, since the
classical cosine similarity measure can be adopted. In the Vector Space model the
similarity is, in other words, evaluated as the cosine of the hyper-angle between the
38

Figure 3.4. The Vector Space Model.
vector of features representing a given document and the vector representing a given
query. Similarity between documents and classes can be evaluated in the same way.
Formally, the cosine similarity is defined as:
Sim(di,dj) =
di · dj
di · dj
where di is the feature vector of di and dj is the feature vector of dj. As demonstrated
by the successful application of this model to many real-world case studies, the
Vector Space is quite an effective solution to the problem of classifying and retrieving
resources. However, at least two shortcomings can be identified: as first, the model
works assuming that the terms in L compose an orthogonal base for the space of
documents and queries.
This assumption is clearly not true since words usually appear in groups, depend-
ing on the document (query) type and domain. Secondly, the approach is strongly
influenced by the features extracted from documents, and since they are in most
cases simple, vocabulary dependent, syntactic features it becomes also syntactic
and vocabulary dependent. As an example, suppose that a wrong vocabulary Lw
contains the two words horse and stallion and suppose that they are not identified
as synonyms (but actually they are, in the analyzed knowledge domain). If a doc-
ument is composed by the single term horse and another document by the single
term stallion they are recognized as completely different. In case a user specifies
39

bonino_thesis_final

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (9)

Similar to bonino_thesis_final

Similar to bonino_thesis_final (20)

More from Dario Bonino

More from Dario Bonino (15)

bonino_thesis_final