Converting PDF to EPUB can be challenging without the right tools. After doing a lot of R&D, SPi has come up with a new approach for extracting text from searchable PDF inputs.
2. Challenges in Text Extraction from PDF
•
PDF is not a markup format. Extracting text from a PDF file is not easy.
•
When extracting the text, we need to take care of fonts, encoding and
sometimes font-subsets.
•
Usual problems encountered when extracting text from PDF using
conventional method are:
Special characters are not properly extracted.
Missing formatting including case changes.
Unwanted merging/splitting of paragraphs.
Content extracted in incorrect order.
Text in columns are mixed up.
We inSPire success.
2
3. Introduction
•
After doing a lot of R&D, SPi has come up with a new approach for
extracting text from searchable PDF inputs.
•
SPiZONE tool was developed to have a generic workflow for OCR on raster
PDF and scanned images, text extraction processes for searchable PDF.
•
Output of SPiZONE Verify is short-tagged text file. It can be further
converted into any output format like XML, ePub etc.
We inSPire success.
3
4. Product Highlights
•
Text extraction is possible for all languages.
•
Text accuracy is more than 99.95%.
•
Table extraction along with column-spanning and row-spanning etc, based
on user input.
•
Image extraction.
•
Options to mark some text as ‘Ignore Text’ within zones, so that it will not
be produced in output.
We inSPire success.
4
5. PDF to Text using SPiZONE - Quick Workflow
SZI Generator
•SZI Generator
(Server Process)
SPiZONE Edit
•Styling and Zoning
Extraction
•PDF to HTML
(Sever Process)
SPiZONE Verify
We inSPire success.
•Content QA
5
6. SZI Generation
•
Sever based process
•
Input: PDF
•
Output: LowRes TIFF and SZI
•
SZI – Styling and Zoning Information
We inSPire success.
6
7. SPiZONE Edit
•
Styling and Zoning application
•
•
Input: TIFF and SZI
Output: SZI
•
User will identify the text to be extracted by drawing zones. When drawing
zones, style names and sequence numbers and other properties, are
assigned to each element.
•
These style names are used during post-extraction processing and during
XML/ePub conversion
•
The zones information are saved in SZI file.
We inSPire success.
7
9. Text Extraction from PDF
•
Server based process.
•
•
Input: PDF and SZI
Output: HTML, SZD
•
SZD – SPiZONE Document used for logging.
•
Font details, uncertain space, soft-hyphens etc are flagged in the extracted
file which are used by SPiZONE Verify.
We inSPire success.
9
10. SPiZONE Verify
•
OCR/Text Extraction QA application.
•
•
Input: Extracted content in HTML format, SZI and LowRes TIFF.
Output: Short-tagged files.
•
With this application user performs a regulated content checking on the
extracted HTML files.
•
Font Normalization is used to make sure all the characters are extracted
fine. User can correct the discrepancies if any.
•
Verify will not allow the user to create short-tagged file without normalizing
all fonts and checking all uncertain space/soft-hyphens.
•
To see how SPIZONE Verify works, open the video on next slide.
We inSPire success.
10
12. Processing SPiZONE Output
•
PDF to Short-tagged text file creation workflow process is generic for all
projects.
•
Short-tagged text files can be further converted into XML or ePub or any
other format as per project requirement.
•
SPiZONE Structure is a customizable application which is used for
conversion into any format like (but not limited to) XML, ePub etc.
•
Structure applications can be built in shorter period of time for any XML
conversion project.
•
SPiZONE ePub application accepts short-tagged files as input to create
ePub2/3.
We inSPire success.
12
24. Know more about PDF to ePUB conversion
http://www.spi-global.com/content-solutions/our-services/publishingsolutions/conversion/convert-pdf-epub
We inSPire success.