NoSQL Technologies from an STM Publishing Perspective (NoSQL Now 2011)
1. NoSQL technologies from an STM
publishing perspective
Bradley P. Allen, Elsevier Labs
Presentation at NoSQL Now 2011
San Jose, CA, USA
2011-08-25
2. Peak physical media: is it here?
• “Music Sales”, New York Times, 1 August 2009.
http://www.nytimes.com/imagepages/2009/08/01/opinion/01blow.ready.html
• “Initial Circs per student”, William Denton, 31 January 2011.
http://www.miskatonic.org/2011/01/31/initial-circs-student
• “Rise of e-book Readers to Result in Decline of Book Publishing Business”, Steven
Mather, iSuppli, 28 April 2011. http://www.isuppli.com/Home-and-Consumer-
Electronics/News/Pages/Rise-of-e-book-Readers-to-Result-in-Decline-of-Book-
Publishing-Business.aspx 2
3. In any case, the challenge to STM publishers is clear
• Print revenue is softening
• Online channels are exploding
– Changing the way customers create and consume
our content
– Leading to new requirements and market
opportunities for online products
3
4. Additional challenges in STM publishing
• Academic context and tradition inhibits
business model innovation
• Technology and business traditionally
separate concerns
• Acquisitions create content and data silos
• Global market drives lowest common
denominator technology choices
4
5. A simple model of the evolution of STM publishing
Print era: 1600s - Digital Library era: Platform-as-a-
1980 1980 – 2010s service era: 2010s
• Packaged as • Packaged as • Packaged as
books and books and apps
journals journals • Digitally
• Physically • Digitally distributed
distributed distributed • Access and
• Access and • Access and discovery
discovery discovery through social
through through search networks
libraries engines
5
6. STM publishing use cases in transition
Use case Digital Library era Platform-as-a-service era
A new medical term relevant to an emerging Organizational governance issues about how A single, automated and standardized
healthcare issue (e.g. a new type of avian flu taxonomies are be updated, coupled with taxonomy management and content
virus) needs to be incorporated into a search manually-intensive workflows and ad-hoc enhancement workflow allows rapid and
index immediately approaches to content tagging, inhibit rapid timely update of search applications
response
Application developers want to mash up Data silos without easy means of Content API and single-point-of-access
epidemiological data with medical journal programmatic access by developers, coupled repository allow data and content to be
articles to create topic-specific Web resource with governance and business model accessed, discovered and reused across
questions , inhibit data reuse multiple applications
Digital library developers want to stage Duplication of core content leads to Consolidation of duplicate repositories into a
content into single repository for unified synchronization, quality control issues single point of truth across all content
search index generation accessible and discoverable through a
Content API eliminates the need for
duplication and synchronization
Third party solutions providers want to No standards, no APIs for point-of-care Standards and APIs that scale across multiple
integrate content (e.g. tagged medical journal content integration across all content and partners, for all content types, for all delivery
articles, medical taxonomies) into point-of- data formats
care solutions
Publishers want to deliver their content to No clear standard or approach for targeting Web- and industry-standards for eReader,
tablets and e-readers in delivery formats that emerging eReader, tablet devices, multiple tablet devices supported as part of standard
take advantage of the displays and interaction and divergent approaches leading to siloed automated processing into delivery channel-
modalities on those devices solutions, duplication of effort specific formats, regularly updated and
exposed through a Content API
Journal publisher wants to integrate content No single point of access to content Easy access to multiple opportunities for
enhancements across multiple subject matter enhancements, no standards for content content enhancements embedded in
areas to add value to products leveraging enhancement suppliers and partners to standard next-generation article formats and
Article of the Future technology deliver enhancements for integration provided using standard content
enhancement formats
6
7. Facets of STM publishing processes
Process Type
Access and
Acquisition Transformation Enhancement Composition Delivery
discovery
Entity Activity Content Type
submitting entity extraction
author product catalog
crawling fact extraction
supplier editor
syndicating clustering article
Web site reviewer
formatting aggregating book
typesetter user
mapping ordering media object
automated process designer
cleansing summarizing entity record
subject matter expert developer
indexing filtering taxonomy
search engine e-book
querying analysis ontology
content repository mobile app
updating rendering user-generated content
entity registry mobile-enhanced Web site
storing design
API
annotating publishing
subject tagging accessing
classification retrieving
entity recognition deleting
7
8. Emerging content requirements
• Broad range of content types • Accessible
– Must treat as first-class objects video, audio, – Must be easily accessed through content
images, datasets, metadata and knowledge creation, retrieval, update and deletion (CRUD)
organization systems in addition to articles and services
books
• Flexible
• Standards-based – New content types and associated schemas
– Web-standard formats to support ease of must be easily added through configuration
integration and interoperability
• Reusable
• Fine-grained – It must be efficient for product developers to
– Must be decomposable into and addressable in aggregate and compose content fragments into
fragments smaller than the unit of publication; new products
e.g., down to the level of specific words,
phrases, images, table cells in articles or book
• Modifiable
chapters, key frames and segments in videos – Support the enhancement and correction of
content at any time following creation
• Discoverable
– Must be easily located across all levels of
• Broad range of delivery formats
granularity, – Content standards and services must support
fulfillment, delivery and presentation across
desktop, notebook, tablet and mobile
computing devices
8
13. Why NoSQL is important to STM publishing
• NoSQL emphasizes design choices that focus on
delivering robust, scalable Web applications
– Document-centric
– Schemaless
– Support for analytics
– Read/write at Web scale
– Move scale-out from development to operations
• As we shift to the platform-as-a-service era,
these features become an important part of the
STM publishing technology stack
13
14. How NoSQL addresses STM publishing’s needs
• Schemaless, document-centric stores
– Ease repository extension to accommodate expanding range of new, finer-
grained content types
– Fit HTML5/JS/CSS content stack providing web-based alternatives to native apps
– Expedite application stack refresh in support of authoring and editorial workflow
portals and tools
• Support for analytics eases innovation in scientometrics
• Read/write at Web scale accommodates solutions incorporating content
at more dynamic, fine-grained scale
– Entity records
– Annotations
– Other forms of community-contributed content
– Linked data integration of heterogeneous information resources across the Web
for mashups/solutions
• Moving scale-out from development to operations reduces time-to-
market, cost of failure for emerging, niche publishing opportunities
14
15. Where STM publishing can drive NoSQL requirements
• Integrated support for search
– Free text retrieval
– Faceted navigation
• Query language functionality
– Nearest-neighbor matching
– Joins vs. join-free
• Primitives/support for analytics design patterns
– Clustering
– Classification
– Entity resolution
• Primitives/support for semantic enhancement
– Linked data
– Language processing
• Versioning for document stores
15
16. Elsevier applications of NoSQL technologies
• Entity registries
• Metadata repositories
• Big data analytics
• User-built apps
16
20. Conclusions
• STM publishing is in transition
• This is driving new requirements for content
• Many of these requirements are well met by
NoSQL solutions
• Some requirements point to areas of future
work for NoSQL technologists and vendors
20