1. Jabin White
Director of Strategic Content
Wolters Kluwer Health – Professional & Education
SSP 31st Annual Meeting
Baltimore, MD | May 28, 2009
2. What is metadata, and why should
publishers care?
Impact on publishers – how metadata
impacts processes
Case Studies – This isn’t your Daddy’s
publishing business
Final Thoughts, Recommendations
3. Reading most definitions of metadata and
related standards is like trying to resolve
disputes with my kids
As Ed said, metadata is “data about data”
• But what does that mean?
Its
use may be increasing, but metadata is
NOT new
4. In the move from print publishing to digital,
metadata is a powerful tool to help publishers
get content in the right place, in the right
format, and known to the right systems and
people, at the right time
Print books were easy
• Everyone knew what they were
• You could really only use them one way
• They had a beginning, an end, a physical presence,
and a set price (mostly)
5. Today, computers are often communicating
with one another as much as they are with
users (people)
Metadata becomes critical in:
• B2B relationships
• Enhancing B2C relationships
• B2-_________ relationships
Thequality of the metadata gives
publishers a more powerful voice in what
happens to their content
6. For example:
• A digital asset (an image)
• What file format is it?
• How big is the image?
• Who took the picture?
• Who owns the picture?
• Can you use it on your web site? If you do, what credit
do you have to give to the owner?
• What date was it created?
• Is it part of a collection?
• Is it related to another piece of content?
7. Ifa publisher’s goal is to disseminate
content to the widest possible audience,
metadata is critical
8. Again, in books you had one use model
Metadata allows publishers to have diverse
relationships with content consumers and other
information providers
• Customers (duh)
• Aggregators
• The Open Web (not Google, but other search engines)
But don’t try to “game” the search engines with adult keywords;
that’s just wrong
There have been lawsuits over use of meta keywords, including
Playboy suing two adult web sites
• Technology partners/developers
• Systems wherein content is a “value add”
• Multiple output formats
9. HTML Metadata
• <meta http-equiv="Content-Type" content="text/html; charset=iso-
8859-1">
• <meta name="verify-v1"
content="kBoFGUuwppiWVWGx4Ypzkw1Cs1GgMYEMMbfNr7F
Y65w=" /> For people
• <meta name="description" content="International publisher of
professional health information for physicians, nurses, specialized
clinicians & students. Medical & nursing charts, journals, and pda
software."> For search enginges
• <meta name="keywords" content="springhouse, medical book,
nursing journal, medical pda software, lippincott medical
reference, lww, lippincott, lww com, medical publisher">
• <link rel="stylesheet" href="/css/style.css" type="text/css">
10. Classifying Metadata DescriptiveMetadata
• ISBN (I told you this (sorry, my examples
wasn’t new) are from STM)
• Dewey Decimal • ICD-9 and ICD-10
System Codes
• Books in • MeSH
Print/CIP/Library of • SNOMED-CT
Congress data
• NANDA, NIC, NOC for
• MARC records Nursing
• DOI (Digital Object • NDC, HCPCS for drugs
Identifier)
11. Classifying Metadata DescriptiveMetadata
• ISBN (I told you this (sorry, my examples
wasn’t new) are from STM)
• Dewey Decimal • ICD-9 and ICD-10
System Codes
• Books in • MeSH
Print/CIP/Library of • SNOMED-CT
Congress data
• NANDA, NIC, NOC for
• MARC records Nursing
• DOI (Digital Object • NDC, HCPCS for drugs
Identifier)
• DOI (Digital Object
Identifier)
12. Usingcontrolled vocabularies, extra power
can be added to content via semantic
tagging to drive:
• More precise searching
• Contextually-based connections
• Lowering of “two terms meaning the same thing”
syndrome (hypertension vs. high blood pressure;
heart attack vs. myocardial infarction)
• Filling in of content gaps
14. Impacton publishers depends on answers
to questions in previous section
• i.e., what am I going to get in return for investing
in metadata, and is it worth it?
• More and more, this is not an “if” proposition, it’s
“how much”
Publisherswho buy in have two basic
choices on approach:
15. Requires deeper commitment, but has bigger
potential upside
• Positive impact on product creation and development
Requires thinking about tools, workflows, and
enterprise-level systems to allow for creation and
maintenance of metadata
Combination of good metadata in the workflow and
creativity in product development team can pay big
benefits
Allows participation of authors (or subject matter
experts in lieu of) at the beginning of the workflow
16. Requires lesser commitment, but potentially
fewer rewards
Can be done with zero impact on current
systems
Has benefit of content being in “final form”
(whatever that means anymore) when
intelligence is added in metadata
Can keep SMEs as a separate offshoot of the
workflow – easily outsourced
Can replace all of the above with software
solutions (Darrell and Chris will talk about
that)
17. Chris, Darrell and I do NOT disagree
There are justifications that can be made
for tagging or entity extraction approaches
(or both)
Just as there is no “one size fits all”
metadata, there is no ONE solution
But if you must pick one, I’m right
18. Active vs. Passive Metadata
• Active metadata
Publisher intentionally associates markup with certain
pieces of content
Often using controlled vocabulary
Includes semantic indexing
Can also be machine-based, using scripts, etc.
• Passive metadata
Metadata created based on use of content
Inheritance of properties from parent objects
19. The use of active metadata usually means an
impact on support tools
• Re-think authoring tools to allow for capture of metadata by
authors
This can be outsourced to external SMEs – help is available
• Re-think content management to allow for
preservation/management of metadata
How deep you go depends on how big the payoff
• Good semantic indexing can drive new features and
functionality, but must used appropriately
If you decide to add active metadata, a controlled
vocabulary just became your new best friend
20. – a specific specification of a
Ontology
conceptualization
• In English: a controlled vocabulary used to
describe a group of topics
Taxonomy – same as ontology, but with
hierarchy implied
Caveat – These two terms are so misused,
their definitions no longer matter (think
Content Management circa 2000)
21. PRISM (Publishing Requirements for Industry
Standard Metadata) – an XML metadata vocabulary for
handling content – started out in magazines and
journals, but has added other types
Dublin Core – named after a 1995 workshop in Dublin,
Ohio, it is, very simply, a set of 15 agreed-upon
metadata elements used to describe objects
• PRISM uses Dublin Core elements and then makes them specific
to publishing
RDF (Resource Description Format): an XML
implementation that lets you richly describe
relationships between data on web pages. Explain
triplets
22. Semantic Web – A web of data. Envisioned by Tim
Berners-Lee, it will be a web driven by data that “talks”
to other data
• My kids will work on this
FOAF Project (Friend of a Friend): Uses RDF to
describe people and their preferences to the web, so
you can find people with similar interests; all about
social networking
SPARQL (Simple Protocol and RDF Query Language)
– once you have used RDF to describe resources and
their connection points, you use SPARQL to ask
questions about those connections and find stuff
OWL (Web Ontology Language) – extends ability of
RDF and XML Schemas to describe information
23. Drug Reference Product
Perfect, structured information that is a great
example of metadata becoming just as
important as content
Examples of things that were stored in
metadata:
• Codes, codes, and more codes
• Drug interaction information
• Classifications (this one was actually redundant)
• Formulary information
• FDA approval date (could also be redundant)
24. Four editors spent as much time working
on metadata as they did on content itself
All work on import/export from DB was
done by:
• Acting on metadata
• Keeping metadata at top of priority list on output
• “Output all drugs anticoagulants that were
approved before 1982”
25. Medical content (5 years ago I would have
said “book”)
Thousands of topics, sometimes printed,
always updated, sent to web, handhelds
How/when they are updated, whether or
not they are printed, and whether or not
they get extracted is all driven by ….
Metadata!
26. Extracts
all are made by acting on
metadata
• What is the subject area of the topic? (this can be
a MANY to ONE relationship)
• When was the topic last updated?
• Who was the author of the last update?
35. Have a metadata strategy
• Business case should support investment in metadata
• Be careful, and stay alert for mission creep – this stuff
can get out of control very easily
Know your organization
• Is it a change tolerant organization? “All in” vs.
measured, incremental approach should be
considered
• Show me someone who says they have the correct
universal approach to metadata, and I’ll show you a
liar
36. A little bit of metadata understanding by
product development people can go a long
way
If a content set can benefit from metadata
in the creation of new products, that could
justify investment in metadata strategy and
tools within the workflow