SlideShare una empresa de Scribd logo
1 de 62
Descargar para leer sin conexión
What publishers need to
         know about digitization
         Liza Daly
         Consultant, Threepress Consulting Inc.
         http://threepress.org/




Thursday, November 13, 2008
Introduction
         Liza Daly                               liza@threepress.org


              Software engineer and consultant specializing in
              web-based publishing applications
              Digitization projects for Ford Foundation, Arnold
              Arboretum, Rosen Publishing and SAGE Publications
              Online reference products for Oxford University Press
              and Columbia University Press
              Current: ebook applications and consulting



Thursday, November 13, 2008
Introduction
         What I’ll cover


                       1. Digitization 101: from scanning to OCR to XML
                       2. Smart vendor selection
                       3. A gentle introduction to XML
                       4. I’ve got digital content: now what?



                                                                  ?
Thursday, November 13, 2008
What we talk about
          when we talk about digitization


              Turning printed content...                  text

              ...or microfilm archives
              ...or documents in legacy systems
              ...into modern digital forms.
              (sometimes starting from print is easier)
                                                          <text>



Thursday, November 13, 2008
Digitization 101

                    Assume that we’re starting from a print archive.
                    (If you’re starting from a digital file, congratulations,
                    your costs just went down -- but not to zero!)




Thursday, November 13, 2008
Scan

                              From paper to digital images...



Thursday, November 13, 2008
OCR

                              ...to digital text...



Thursday, November 13, 2008
XML

                              ...to reusable markup.

Thursday, November 13, 2008
Digitization 101
         Scanning




 http://www.flickr.com/photos/heather-dietz/448629362/

Thursday, November 13, 2008
Digitization 101
         Scanning



                                                        Scan




 http://www.flickr.com/photos/heather-dietz/448629362/

Thursday, November 13, 2008
Digitization 101
         Scanning methods

                              Destructive scanning
                              Pages are cut out of the binding and
                              machine-fed into the scanner in batch.
                              (Imagine a huge office copier.)
                              Scanned copies are normally destroyed.




Thursday, November 13, 2008
Digitization 101
         Scanning methods

          Non-destructive scanning
          Pages kept in their original binding
          Manual page-turning
          Originals are returned to the source
          Primarily for rare or historical works




Thursday, November 13, 2008
Digitization 101
         Scanning methods

                   High-volume,
                   non-destructive
                   automated
                   scanning also
                   exists.




Thursday, November 13, 2008
Digitization 101
         OCR
                  Optical Character Recognition
                  OCR software “guesses” the letters that appear in an
                  image. A dictionary is used to help correct errors.




                  Common errors include wordsruntogether or
                  speling mistakes.



Thursday, November 13, 2008
Digitization 101
         OCR

              OCR quality is sensitive to a number of factors.
              Is the document in good condition with clear type?
              Is the layout simple or complex?
              Is a custom dictionary required for proper names or
              obscure terms?




Thursday, November 13, 2008
This is easy.



Thursday, November 13, 2008
This is hard.



Thursday, November 13, 2008
http://timesmachine.nytimes.com/


Thursday, November 13, 2008
Digitization 101
         OCR

                                           Better OCR    Worse OCR

                                                         Multicolumn,
                                 Layout    Simple text
                                                          sidebars

                              Vocabulary   Common        Specialized

                                                   Damaged, dirty or
                  Source quality Clean and legible
                                                       partial


Thursday, November 13, 2008
Digitization 101
         OCR

              Limitations and cautions:
              Documents with specialized jargon, such as medical
              journals or archaic texts, will require custom
              dictionaries.
              Tables and equations aren’t suitable for OCR.
              A human check is always advisable.



Thursday, November 13, 2008
If the goal of digitization is to
         make content findable on
         the web, the text needs to
         be correct.


Thursday, November 13, 2008
SCAN the documents to
                                 convert to digital files

                         Apply OCR to the scans to get
                                  computer-ready text



                              Convert the text into XML    X

Thursday, November 13, 2008
Digitization 101
         XML


                         Not all digitization projects end with XML.

                         Why?




Thursday, November 13, 2008
Characters-per-page versus digitization cost/time




                          1,000   1,500         2,000     3,000+
                                      XML
                                      Human-checked OCR
                                      Machine OCR




Thursday, November 13, 2008
Vendor selection
                              and costs



Thursday, November 13, 2008
Consider:                  But also:
                  Quantity of material       Project management
                  Quality of the originals   Shipping
                  Layout complexity          Heterogeneous content
                  Vocabulary                 Front/back matter &
                                             indexes




Thursday, November 13, 2008
Consider:                  But also:
                  Quantity of material       Project management
                  Quality of the originals   Shipping
                  Layout complexity          Heterogeneous content
                  Vocabulary                 Front/back matter &
                                             indexes




Thursday, November 13, 2008
Vendor tips
                   Send samples before considering any estimate
                       ...and have the output evaluated.
                   Compare not just cost-per-page but estimated time.
                   Feel comfortable with their project management.
                   Check references!




Thursday, November 13, 2008
Should you partner?




Thursday, November 13, 2008
?
Thursday, November 13, 2008
?

                              ?
Thursday, November 13, 2008
It’s too early to say whether
                              Google Books is right for all
                              publishers.


                              But you’re certainly giving up:
                                1. Control
                                2. Revenue share
                                3. Ownership



Thursday, November 13, 2008
Creative partnerships
                                      Consider whether some of
                                      your backlist is public
                                      domain or can be released
                                      under a Creative
                                      Commons license.




Thursday, November 13, 2008
XML 101




Thursday, November 13, 2008
XML 101
         What’s XML?


                      XML is just plain text, with markers to
                      tell a computer what the text means
                      and how it should be laid out.




Thursday, November 13, 2008
XML 101
         What’s XML?

         Text with “markup” is an old idea.



                              This is a paragraph.¶
                              This is another paragraph.




Thursday, November 13, 2008
XML 101
         What’s XML?

         XML just changes the symbols around.



                              <p>This is a paragraph.</p>
                              <p>This is another paragraph.</p>




Thursday, November 13, 2008
XML 101
         What’s XML good for?


                          1. Everybody speaks it.

                          2. Once you have one kind of XML,
                             it’s easy to turn it into another kind.




Thursday, November 13, 2008
When you decide to digitize to XML,
             you’ll need to pick what kind of XML you want.




Thursday, November 13, 2008
Kinds of XML




Thursday, November 13, 2008
Kinds of XML

                              DTD




Thursday, November 13, 2008
Kinds of XML

                                    Language
                              DTD




Thursday, November 13, 2008
Kinds of XML

                                    Language
                              DTD




                                        Format




Thursday, November 13, 2008
Kinds of XML

                                         Language
                              DTD


                                    Schema
                                             Format




Thursday, November 13, 2008
Kinds of XML

                                          Language
                               DTD


                                     Schema
                                              Format
                              XSD




Thursday, November 13, 2008
Kinds of XML

                                          Language
                               DTD


                                     Schema
                                              Format
                              XSD




Thursday, November 13, 2008
XML 101
         Schema vocabulary


              The schema defines the list of <tags> that appear in a
              document, and what they mean.
              A paragraph ¶ in one schema might be <p>, but in
              another it might be <para>.




Thursday, November 13, 2008
METS/
                                     DocBook
                                               ALTO



                              ePub                     PRISM




                                      DAISY     TEI




Thursday, November 13, 2008
METS/
                                     DocBook
                                               ALTO



                              ePub       XML           PRISM




                                      DAISY     TEI




Thursday, November 13, 2008
XML 101
         Choosing a schema

                                Books     DocBook, DAISY, ePub, TEI


                         Magazines/
                        Newspapers           METS/ALTO, PRISM


                              Scholarly         TEI, MathML



Thursday, November 13, 2008
XML 101
         DIY schemas

                              Creating your own schema
                                should be a last resort.

                     Expensive to build and maintain.
                     High training and hiring costs.
                     Reduced opportunities for interoperability.
                     Regulatory compliance.

Thursday, November 13, 2008
XML 101
         DIY schemas

                              Creating your own schema
                                should be a last resort.

                     Expensive to build and maintain.
                     High training and hiring costs.
                     Reduced opportunities for interoperability.
                     Regulatory compliance.

Thursday, November 13, 2008
Complex schemas cost more...

                                 $$$




                                   $
                                       Low              High

                              ...but also provide more opportunity
                              for product development.
Thursday, November 13, 2008
Now what?




Thursday, November 13, 2008
Monetizing
         XML conversion



                              XML


Thursday, November 13, 2008
Monetizing
         XML conversion



                              XML   web


Thursday, November 13, 2008
XML        web


Thursday, November 13, 2008
XML        web


Thursday, November 13, 2008
UGC                 web


Thursday, November 13, 2008
Remixing content


         XML allows content
    to be distributed, altered,
        and recontextualized
        in unexpected ways.




                                       http://flickr.com/photos/thomashawk/2492298772/
Thursday, November 13, 2008
Small Beer Press




Thursday, November 13, 2008
Questions?

                     Liza Daly
                     Threepress Consulting Inc.
                     +01 617 301 0552
                     liza@threepress.org




Thursday, November 13, 2008

Más contenido relacionado

Similar a What publishers need to know about digitization

Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
Technical Debt
Technical DebtTechnical Debt
Technical DebtKmanthei
 
A Look at the Future of HTML5
A Look at the Future of HTML5A Look at the Future of HTML5
A Look at the Future of HTML5Tim Wright
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured DataChristine Connors
 
Scientific Applications with Python
Scientific Applications with PythonScientific Applications with Python
Scientific Applications with PythonEnthought, Inc.
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 
Good Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On DemandGood Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On Demandzsvoboda
 
Content Management Selection and Strategy
Content Management Selection and StrategyContent Management Selection and Strategy
Content Management Selection and StrategyIvo Jansch
 
Ibuildings Cms Talk
Ibuildings Cms TalkIbuildings Cms Talk
Ibuildings Cms Talkdean1985
 
Non-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for LibrariesNon-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for LibrariesCrossref
 

Similar a What publishers need to know about digitization (11)

Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Technical Debt
Technical DebtTechnical Debt
Technical Debt
 
A Look at the Future of HTML5
A Look at the Future of HTML5A Look at the Future of HTML5
A Look at the Future of HTML5
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
 
Node at artsy
Node at artsyNode at artsy
Node at artsy
 
Scientific Applications with Python
Scientific Applications with PythonScientific Applications with Python
Scientific Applications with Python
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 
Good Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On DemandGood Data: Collaborative Analytics On Demand
Good Data: Collaborative Analytics On Demand
 
Content Management Selection and Strategy
Content Management Selection and StrategyContent Management Selection and Strategy
Content Management Selection and Strategy
 
Ibuildings Cms Talk
Ibuildings Cms TalkIbuildings Cms Talk
Ibuildings Cms Talk
 
Non-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for LibrariesNon-Technical Introduction to CrossRef for Libraries
Non-Technical Introduction to CrossRef for Libraries
 

Más de Liza Daly

pycon-2015-liza-daly
pycon-2015-liza-dalypycon-2015-liza-daly
pycon-2015-liza-dalyLiza Daly
 
liza-daly-cultivate-2015
liza-daly-cultivate-2015liza-daly-cultivate-2015
liza-daly-cultivate-2015Liza Daly
 
Streaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentationStreaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentationLiza Daly
 
EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3Liza Daly
 
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading enginesBnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading enginesLiza Daly
 
Networked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current EreadersNetworked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current EreadersLiza Daly
 
ePub: The open ebook format
ePub: The open ebook formatePub: The open ebook format
ePub: The open ebook formatLiza Daly
 
Survey Of Current E-Readers
Survey Of Current E-ReadersSurvey Of Current E-Readers
Survey Of Current E-ReadersLiza Daly
 

Más de Liza Daly (8)

pycon-2015-liza-daly
pycon-2015-liza-dalypycon-2015-liza-daly
pycon-2015-liza-daly
 
liza-daly-cultivate-2015
liza-daly-cultivate-2015liza-daly-cultivate-2015
liza-daly-cultivate-2015
 
Streaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentationStreaming Digital Books: IDPF Digital Book 2012 presentation
Streaming Digital Books: IDPF Digital Book 2012 presentation
 
EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3EPUB Evolutions: Towards HTML5 and CSS3
EPUB Evolutions: Towards HTML5 and CSS3
 
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading enginesBnc Tech Forum 2010: Designing ebooks for ePub reading engines
Bnc Tech Forum 2010: Designing ebooks for ePub reading engines
 
Networked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current EreadersNetworked, Mobile, and Landlocked: Current Ereaders
Networked, Mobile, and Landlocked: Current Ereaders
 
ePub: The open ebook format
ePub: The open ebook formatePub: The open ebook format
ePub: The open ebook format
 
Survey Of Current E-Readers
Survey Of Current E-ReadersSurvey Of Current E-Readers
Survey Of Current E-Readers
 

Último

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

What publishers need to know about digitization

  • 1. What publishers need to know about digitization Liza Daly Consultant, Threepress Consulting Inc. http://threepress.org/ Thursday, November 13, 2008
  • 2. Introduction Liza Daly liza@threepress.org Software engineer and consultant specializing in web-based publishing applications Digitization projects for Ford Foundation, Arnold Arboretum, Rosen Publishing and SAGE Publications Online reference products for Oxford University Press and Columbia University Press Current: ebook applications and consulting Thursday, November 13, 2008
  • 3. Introduction What I’ll cover 1. Digitization 101: from scanning to OCR to XML 2. Smart vendor selection 3. A gentle introduction to XML 4. I’ve got digital content: now what? ? Thursday, November 13, 2008
  • 4. What we talk about when we talk about digitization Turning printed content... text ...or microfilm archives ...or documents in legacy systems ...into modern digital forms. (sometimes starting from print is easier) <text> Thursday, November 13, 2008
  • 5. Digitization 101 Assume that we’re starting from a print archive. (If you’re starting from a digital file, congratulations, your costs just went down -- but not to zero!) Thursday, November 13, 2008
  • 6. Scan From paper to digital images... Thursday, November 13, 2008
  • 7. OCR ...to digital text... Thursday, November 13, 2008
  • 8. XML ...to reusable markup. Thursday, November 13, 2008
  • 9. Digitization 101 Scanning http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
  • 10. Digitization 101 Scanning Scan http://www.flickr.com/photos/heather-dietz/448629362/ Thursday, November 13, 2008
  • 11. Digitization 101 Scanning methods Destructive scanning Pages are cut out of the binding and machine-fed into the scanner in batch. (Imagine a huge office copier.) Scanned copies are normally destroyed. Thursday, November 13, 2008
  • 12. Digitization 101 Scanning methods Non-destructive scanning Pages kept in their original binding Manual page-turning Originals are returned to the source Primarily for rare or historical works Thursday, November 13, 2008
  • 13. Digitization 101 Scanning methods High-volume, non-destructive automated scanning also exists. Thursday, November 13, 2008
  • 14. Digitization 101 OCR Optical Character Recognition OCR software “guesses” the letters that appear in an image. A dictionary is used to help correct errors. Common errors include wordsruntogether or speling mistakes. Thursday, November 13, 2008
  • 15. Digitization 101 OCR OCR quality is sensitive to a number of factors. Is the document in good condition with clear type? Is the layout simple or complex? Is a custom dictionary required for proper names or obscure terms? Thursday, November 13, 2008
  • 16. This is easy. Thursday, November 13, 2008
  • 17. This is hard. Thursday, November 13, 2008
  • 19. Digitization 101 OCR Better OCR Worse OCR Multicolumn, Layout Simple text sidebars Vocabulary Common Specialized Damaged, dirty or Source quality Clean and legible partial Thursday, November 13, 2008
  • 20. Digitization 101 OCR Limitations and cautions: Documents with specialized jargon, such as medical journals or archaic texts, will require custom dictionaries. Tables and equations aren’t suitable for OCR. A human check is always advisable. Thursday, November 13, 2008
  • 21. If the goal of digitization is to make content findable on the web, the text needs to be correct. Thursday, November 13, 2008
  • 22. SCAN the documents to convert to digital files Apply OCR to the scans to get computer-ready text Convert the text into XML X Thursday, November 13, 2008
  • 23. Digitization 101 XML Not all digitization projects end with XML. Why? Thursday, November 13, 2008
  • 24. Characters-per-page versus digitization cost/time 1,000 1,500 2,000 3,000+ XML Human-checked OCR Machine OCR Thursday, November 13, 2008
  • 25. Vendor selection and costs Thursday, November 13, 2008
  • 26. Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
  • 27. Consider: But also: Quantity of material Project management Quality of the originals Shipping Layout complexity Heterogeneous content Vocabulary Front/back matter & indexes Thursday, November 13, 2008
  • 28. Vendor tips Send samples before considering any estimate ...and have the output evaluated. Compare not just cost-per-page but estimated time. Feel comfortable with their project management. Check references! Thursday, November 13, 2008
  • 29. Should you partner? Thursday, November 13, 2008
  • 31. ? ? Thursday, November 13, 2008
  • 32. It’s too early to say whether Google Books is right for all publishers. But you’re certainly giving up: 1. Control 2. Revenue share 3. Ownership Thursday, November 13, 2008
  • 33. Creative partnerships Consider whether some of your backlist is public domain or can be released under a Creative Commons license. Thursday, November 13, 2008
  • 35. XML 101 What’s XML? XML is just plain text, with markers to tell a computer what the text means and how it should be laid out. Thursday, November 13, 2008
  • 36. XML 101 What’s XML? Text with “markup” is an old idea. This is a paragraph.¶ This is another paragraph. Thursday, November 13, 2008
  • 37. XML 101 What’s XML? XML just changes the symbols around. <p>This is a paragraph.</p> <p>This is another paragraph.</p> Thursday, November 13, 2008
  • 38. XML 101 What’s XML good for? 1. Everybody speaks it. 2. Once you have one kind of XML, it’s easy to turn it into another kind. Thursday, November 13, 2008
  • 39. When you decide to digitize to XML, you’ll need to pick what kind of XML you want. Thursday, November 13, 2008
  • 40. Kinds of XML Thursday, November 13, 2008
  • 41. Kinds of XML DTD Thursday, November 13, 2008
  • 42. Kinds of XML Language DTD Thursday, November 13, 2008
  • 43. Kinds of XML Language DTD Format Thursday, November 13, 2008
  • 44. Kinds of XML Language DTD Schema Format Thursday, November 13, 2008
  • 45. Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
  • 46. Kinds of XML Language DTD Schema Format XSD Thursday, November 13, 2008
  • 47. XML 101 Schema vocabulary The schema defines the list of <tags> that appear in a document, and what they mean. A paragraph ¶ in one schema might be <p>, but in another it might be <para>. Thursday, November 13, 2008
  • 48. METS/ DocBook ALTO ePub PRISM DAISY TEI Thursday, November 13, 2008
  • 49. METS/ DocBook ALTO ePub XML PRISM DAISY TEI Thursday, November 13, 2008
  • 50. XML 101 Choosing a schema Books DocBook, DAISY, ePub, TEI Magazines/ Newspapers METS/ALTO, PRISM Scholarly TEI, MathML Thursday, November 13, 2008
  • 51. XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
  • 52. XML 101 DIY schemas Creating your own schema should be a last resort. Expensive to build and maintain. High training and hiring costs. Reduced opportunities for interoperability. Regulatory compliance. Thursday, November 13, 2008
  • 53. Complex schemas cost more... $$$ $ Low High ...but also provide more opportunity for product development. Thursday, November 13, 2008
  • 55. Monetizing XML conversion XML Thursday, November 13, 2008
  • 56. Monetizing XML conversion XML web Thursday, November 13, 2008
  • 57. XML web Thursday, November 13, 2008
  • 58. XML web Thursday, November 13, 2008
  • 59. UGC web Thursday, November 13, 2008
  • 60. Remixing content XML allows content to be distributed, altered, and recontextualized in unexpected ways. http://flickr.com/photos/thomashawk/2492298772/ Thursday, November 13, 2008
  • 61. Small Beer Press Thursday, November 13, 2008
  • 62. Questions? Liza Daly Threepress Consulting Inc. +01 617 301 0552 liza@threepress.org Thursday, November 13, 2008