1. Knowledge Organization and Discovery
Presented by Dave Clarke to the Global Knowledge Management Congress, Mumbai, June 23rd
2016
Page 1 of 3
Introduction
Knowledge Management is a powerful tool for
improving the business results of any enterprise.
It achieves this through engaging the enterprise
in a concerted effort to optimise the use of
knowledge assets, which include recorded
information and methodologies as well as
human experience and know-how.
KM is both a strategic and a tactical endeavour.
It will fail without vision and leadership from
the top. It will only succeed when all parts of the
organisation collaborate to align people and
resources toward a common goal.
KM is also a technical endeavour. It requires a
systematic approach based on standards and best
practices, as well as the support of software
systems for managing knowledge organisation,
discovery, and sharing.
The technologies and the methodologies used to
organise, discover and share knowledge are the
subject of this talk. Academically this activity is
firmly grounded in library and information
science, but other disciplines are now playing an
increasingly important role, including computer
and data science, language engineering, linked
data, social media and user experience design.
The Origins and Evolution of Knowledge
Organization
Knowledge Organization methods and practices
started as soon as people began collecting
knowledge in libraries. In the 3rd
century BCE
the poet and scholar Callimachus of Cyrene
produced a 120 volume bibliographic catalogue
of the half-million works held in the Library of
Alexandria. The catalogue comprised indexes
and tables of information including the title,
author and subject, as well as brief biographies
and abstracts.
In the late 19th
century CE the American
librarian and educator Melvil Dewey developed
a hierarchically organized classification system.
The Dewy Decimal Classification system was
widely adopted by libraries around the world
and is still in use today. In the 20th
Century CE
the Indian mathematician and librarian S. R.
Ranganathan devised a facetted classification
system (colon classification), supporting much
greater indexing specificity.
Library classification schemes were primarily
designed to place physical books on to physical
shelves. This posed an immediate knowledge
organization challenge. A physical book can’t be
in two places at the same time, but many books
span multiple subjects.
The problem was overcome by creating indexes.
These could locate a book under the multiple
subject headings that it is about, as well as by
title, author, and other parameters.
Simple card indexes grew in size and
complexity, taking the form of alphabetical and
hierarchical indexes of subjects, places and
people. Libraries standardized the terminology
they used by creating ‘controlled vocabularies’
and ‘authority files’, such as the Library of
Congress Subject Headings.
When information was stored digitally rather
than on paper, computers could search the full
text of books and documents at lightning speed.
For a while some people thought that they no
longer needed the knowledge organization tools
and methods developed by libraries.
Most enterprises came to realize, however, that
full text search has its own limitations. Search
can find specific words or phrases but it has no
understanding which words and phrases are
significant; it doesn’t understand what a
document is about. Additionally, because one
word can refer to different things and one thing
can be described by different words, searching
full text is necessarily imprecise. For every
relevant document returned a user may have to
wade through hundreds or thousands of
irrelevant documents.
A more insidious problem is that some relevant
documents will never be seen because the
language used by the searcher doesn’t match the
language used by the author.
Knowledge Organization Systems
KOS are formally structured schemes that
describe collections of like-things such as
subjects, people and places. They disambiguate
2. Knowledge Organization and Discovery
Presented by Dave Clarke to the Global Knowledge Management Congress, Mumbai, June 23rd
2016
Page 2 of 3
words with multiple meanings and also store
variant terminology. Relationships assert facts
about how concepts and entities relate to one
another.
Numerous national and international standards
are available to help guide the design of KOS
schemes, foremost among them are ISO 25964
and W3C SKOS and OWL.
Linked Data
One of the most exciting innovations in the
management of knowledge is the prolific growth
of Linked Data. Linked Data allows diverse
datasets to be recombined in novel ways. This
can be kept behind the firewall as Linked
Enterprise Data or exposed to the world as
Linked Open Data. Linked Open Data provides
a framework for the entire planet to share
knowledge. When different user communities
and perspectives come together and collaborate,
our collective knowledge is enriched in the
process. Any organization can start tapping into
a wealth of structured knowledge currently
available as Linked Open Data. Organizations
that fail to embrace the Linked Data opportunity
may struggle to keep up with those who do.
Semantic Annotation
Words and phrases found in content may be
individually annotated by linking them to the
most specific concepts and names they discuss
or represent. Larger sections of content
including complete documents can be classified
to the broader categories they are about. The
indexing process can also extract concepts and
names from content and submit them as
candidates for inclusion in the ontology. Natural
Language Processing (NLP) techniques analyze
the text and metadata of content. The indexing
process leverages the semantic relationships of
the ontology to determine the context of words
and phrases within the text and then match them
to disambiguated concepts in the ontology.
Semantic Enrichment
Concepts and names within an enterprise
ontology may be mapped to equivalents in
external ontologies. Alternatively, external
ontologies may be adopted and used directly as
reference authorities. When internal content is
indexed or mapped to external resources then
additional information contained in those
resources can be retrieved. For example, text
analysis may identify that a document mentions
‘London’ and that this refers to London,
England as distinct from London, Ontario. Once
the named entity has been unambiguously
identified it can be mapped to equivalent entities
in resources like DBpedia and GeoNames.
These resources can then deliver additional
information such as latitude and longitude
coordinates, population statistics, maps, images
and data on industry, commerce and
government.
Human and Machine Indexing
Indexing content can be a fully manual process,
a fully automated process, or most often a
mixture of the two. Automation involves the use
of Named Entity Recognition (NER) and
Natural Language Processing (NLP) systems. It
also uses the semantic relationships found
within Knowledge Organization Systems plus
general-purpose lexicons and grammatical
parsers, along with custom build indexing rules
and machine learning processes. Modular NLP
components can be assembled to create a finely-
tuned semantic indexing pipeline.
During the prototyping phase tools may be
employed to create reference training sets using
consensus-based human indexing. NLP and
NER processes may then be tested against the
training sets and optimized. After the
prototyping phase, corpora, ontologies and
NLP/NER processing resources may be
compiled into a semantic annotation application.
Knowledge Discovery
We first discover knowledge about the world
directly through our senses. Through touch,
smell, taste, sight and sound. We then discover
knowledge vicariously through books, media
and dialogue with other people.
In the digital age most of the vicarious
information we receive is delivered via
computers and mobile devices. Screens have
become the medium between people and
knowledge. Knowledge discovery is the end-
game of knowledge organization.
3. Knowledge Organization and Discovery
Presented by Dave Clarke to the Global Knowledge Management Congress, Mumbai, June 23rd
2016
Page 3 of 3
Modes of Human Interaction with
Information Systems
There are three fundamentally different ways
that humans interact with information systems:
(i) Search, which starts with a user’s expression
of their question (usually one or two keywords)
and then follows iterative refinement; (ii)
Browse, which starts with the system presenting
organized lists or graphs of related things and
then follows the user’s chosen pathway; and (iii)
Discovery, which happens when the user’s
search or browse experience is interrupted upon
the surfacing of relevant concepts or content that
were not present in the user’s initial query.
Examples of Knowledge Discovery
The presentation slides accompanying this talk
provide five examples illustrating how
Knowledge Organization Systems support
knowledge discovery.
In example 1 the user searched on “love” – the
Knowledge Organization System was accessed
to identify artworks and specific figurative
details that are about the concept of love
regardless of what words may actually exist in
the content.
In example 2 the user pans and zooms around
images and the Knowledge Organization System
responds by dynamically updating a discovery
panel to reveal the concepts related to what is in
view.
In example 3 the user browses alphabetical and
hierarchical lists of concepts and visual features
and the system responds by opening the image
and panning and zooming to the specific visual
details.
In example 4 the Knowledge Organization
System works behind the scenes to facilitate the
discovery of conceptually related content. Using
Linked Open Data KOS opens more gateways to
discover related external content from sources
such as DBpedia.
Finally, example 5 illustrates how Knowledge
Organization Systems and Linked Open Data
ontologies allow enterprises to ‘search outside
the box’ of their own content, enabling questions
to be answered where the relevant data does not
exist within the enterprise’s internal content.
The example uses NLP technology to identify
people within news articles. The people are then
semantically indexed using named entities in
external biographical ontologies. The external
biographical data can then be queried along with
the internal news content.
This technique enables powerful queries to be
performed, such as find news articles about
politicians born in England who are under the
age of 50 and hold cabinet positions. The
internal content only has to mention the names
of people within the text. The rest of the query
can be answered by referencing knowledge
contained in external KOS ontologies.
Conclusions
KM can help enterprises to create, preserve and
disseminate actionable knowledge. At the level
of knowledge organisation and discovery the
key challenge is to help people to see both the
forest and the trees. To optimise search and find
the needle in the haystack, while also being able
to contextualise how resources relate to one
another. It addresses the challenges of retrieval,
and relevance.
As digital data continues to grow exponentially
the need for knowledge organization tools only
increases. We need systems that can index
documents by all the significant concepts they
discuss and then summarize what each
document is about. Controlled vocabularies and
authority files are key enablers in this
endeavour.