Convert PDF to EPUB with SPiZone

SPiZONE
Presentation
We inSPire success.

Challenges in Text Extraction from PDF
•

PDF is not a markup format. Extracting text from a PDF file is not easy.

•

When extracting the text, we need to take care of fonts, encoding and
sometimes font-subsets.

•

Usual problems encountered when extracting text from PDF using
conventional method are:
 Special characters are not properly extracted.
 Missing formatting including case changes.
 Unwanted merging/splitting of paragraphs.
 Content extracted in incorrect order.
 Text in columns are mixed up.

We inSPire success.

2

Introduction
•

After doing a lot of R&D, SPi has come up with a new approach for
extracting text from searchable PDF inputs.

•

SPiZONE tool was developed to have a generic workflow for OCR on raster
PDF and scanned images, text extraction processes for searchable PDF.

•

Output of SPiZONE Verify is short-tagged text file. It can be further
converted into any output format like XML, ePub etc.

We inSPire success.

3

Product Highlights
•

Text extraction is possible for all languages.

•

Text accuracy is more than 99.95%.

•

Table extraction along with column-spanning and row-spanning etc, based
on user input.

•

Image extraction.

•

Options to mark some text as ‘Ignore Text’ within zones, so that it will not
be produced in output.

We inSPire success.

4

PDF to Text using SPiZONE - Quick Workflow

SZI Generator

•SZI Generator
(Server Process)

SPiZONE Edit

•Styling and Zoning

Extraction

•PDF to HTML
(Sever Process)

SPiZONE Verify

We inSPire success.

•Content QA

5

SZI Generation
•

Sever based process

•

Input: PDF

•

Output: LowRes TIFF and SZI

•

SZI – Styling and Zoning Information

We inSPire success.

6

SPiZONE Edit
•

Styling and Zoning application

•
•

Input: TIFF and SZI
Output: SZI

•

User will identify the text to be extracted by drawing zones. When drawing
zones, style names and sequence numbers and other properties, are
assigned to each element.

•

These style names are used during post-extraction processing and during
XML/ePub conversion

•

The zones information are saved in SZI file.

We inSPire success.

7

SPiZONE Edit -- DEMO

We inSPire success.

8

Text Extraction from PDF
•

Server based process.

•
•

Input: PDF and SZI
Output: HTML, SZD

•

SZD – SPiZONE Document used for logging.

•

Font details, uncertain space, soft-hyphens etc are flagged in the extracted
file which are used by SPiZONE Verify.

We inSPire success.

9

SPiZONE Verify
•

OCR/Text Extraction QA application.

•
•

Input: Extracted content in HTML format, SZI and LowRes TIFF.
Output: Short-tagged files.

•

With this application user performs a regulated content checking on the
extracted HTML files.

•

Font Normalization is used to make sure all the characters are extracted
fine. User can correct the discrepancies if any.

•

Verify will not allow the user to create short-tagged file without normalizing
all fonts and checking all uncertain space/soft-hyphens.

•

To see how SPIZONE Verify works, open the video on next slide.

We inSPire success.

10

SPiZONE Verify -- DEMO

We inSPire success.

11

Processing SPiZONE Output
•

PDF to Short-tagged text file creation workflow process is generic for all
projects.

•

Short-tagged text files can be further converted into XML or ePub or any
other format as per project requirement.

•

SPiZONE Structure is a customizable application which is used for
conversion into any format like (but not limited to) XML, ePub etc.

•

Structure applications can be built in shorter period of time for any XML
conversion project.

•

SPiZONE ePub application accepts short-tagged files as input to create
ePub2/3.

We inSPire success.

12

SPiZONE Edit Samples

We inSPire success.

13


We inSPire success.

14


We inSPire success.

15

SPiZONE Verify Samples

We inSPire success.

16


We inSPire success.

17


We inSPire success.

18


We inSPire success.

19


We inSPire success.

20

ePUB Output Samples

We inSPire success.

21

ePUB Output Samples

We inSPire success.

22

ePUB Output Samples

We inSPire success.

23

Know more about PDF to ePUB conversion
http://www.spi-global.com/content-solutions/our-services/publishingsolutions/conversion/convert-pdf-epub

We inSPire success.

Convert PDF to EPUB with SPiZone

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Convert PDF to EPUB with SPiZone