SlideShare una empresa de Scribd logo
1 de 24
SPiZONE
Presentation
We inSPire success.
Challenges in Text Extraction from PDF
•

PDF is not a markup format. Extracting text from a PDF file is not easy.

•

When extracting the text, we need to take care of fonts, encoding and
sometimes font-subsets.

•

Usual problems encountered when extracting text from PDF using
conventional method are:
 Special characters are not properly extracted.
 Missing formatting including case changes.
 Unwanted merging/splitting of paragraphs.
 Content extracted in incorrect order.
 Text in columns are mixed up.

We inSPire success.

2
Introduction
•

After doing a lot of R&D, SPi has come up with a new approach for
extracting text from searchable PDF inputs.

•

SPiZONE tool was developed to have a generic workflow for OCR on raster
PDF and scanned images, text extraction processes for searchable PDF.

•

Output of SPiZONE Verify is short-tagged text file. It can be further
converted into any output format like XML, ePub etc.

We inSPire success.

3
Product Highlights
•

Text extraction is possible for all languages.

•

Text accuracy is more than 99.95%.

•

Table extraction along with column-spanning and row-spanning etc, based
on user input.

•

Image extraction.

•

Options to mark some text as ‘Ignore Text’ within zones, so that it will not
be produced in output.

We inSPire success.

4
PDF to Text using SPiZONE - Quick Workflow

SZI Generator

•SZI Generator
(Server Process)

SPiZONE Edit

•Styling and Zoning

Extraction

•PDF to HTML
(Sever Process)

SPiZONE Verify

We inSPire success.

•Content QA

5
SZI Generation
•

Sever based process

•

Input: PDF

•

Output: LowRes TIFF and SZI

•

SZI – Styling and Zoning Information

We inSPire success.

6
SPiZONE Edit
•

Styling and Zoning application

•
•

Input: TIFF and SZI
Output: SZI

•

User will identify the text to be extracted by drawing zones. When drawing
zones, style names and sequence numbers and other properties, are
assigned to each element.

•

These style names are used during post-extraction processing and during
XML/ePub conversion

•

The zones information are saved in SZI file.

We inSPire success.

7
SPiZONE Edit -- DEMO

We inSPire success.

8
Text Extraction from PDF
•

Server based process.

•
•

Input: PDF and SZI
Output: HTML, SZD

•

SZD – SPiZONE Document used for logging.

•

Font details, uncertain space, soft-hyphens etc are flagged in the extracted
file which are used by SPiZONE Verify.

We inSPire success.

9
SPiZONE Verify
•

OCR/Text Extraction QA application.

•
•

Input: Extracted content in HTML format, SZI and LowRes TIFF.
Output: Short-tagged files.

•

With this application user performs a regulated content checking on the
extracted HTML files.

•

Font Normalization is used to make sure all the characters are extracted
fine. User can correct the discrepancies if any.

•

Verify will not allow the user to create short-tagged file without normalizing
all fonts and checking all uncertain space/soft-hyphens.

•

To see how SPIZONE Verify works, open the video on next slide.

We inSPire success.

10
SPiZONE Verify -- DEMO

We inSPire success.

11
Processing SPiZONE Output
•

PDF to Short-tagged text file creation workflow process is generic for all
projects.

•

Short-tagged text files can be further converted into XML or ePub or any
other format as per project requirement.

•

SPiZONE Structure is a customizable application which is used for
conversion into any format like (but not limited to) XML, ePub etc.

•

Structure applications can be built in shorter period of time for any XML
conversion project.

•

SPiZONE ePub application accepts short-tagged files as input to create
ePub2/3.

We inSPire success.

12
SPiZONE Edit Samples

We inSPire success.

13
SPiZONE Edit Samples

We inSPire success.

14
SPiZONE Edit Samples

We inSPire success.

15
SPiZONE Verify Samples

We inSPire success.

16
SPiZONE Verify Samples

We inSPire success.

17
SPiZONE Verify Samples

We inSPire success.

18
SPiZONE Verify Samples

We inSPire success.

19
SPiZONE Verify Samples

We inSPire success.

20
ePUB Output Samples

We inSPire success.

21
ePUB Output Samples

We inSPire success.

22
ePUB Output Samples

We inSPire success.

23
Know more about PDF to ePUB conversion
http://www.spi-global.com/content-solutions/our-services/publishingsolutions/conversion/convert-pdf-epub

We inSPire success.

Más contenido relacionado

Último

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Destacado

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Destacado (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Convert PDF to EPUB with SPiZone

  • 2. Challenges in Text Extraction from PDF • PDF is not a markup format. Extracting text from a PDF file is not easy. • When extracting the text, we need to take care of fonts, encoding and sometimes font-subsets. • Usual problems encountered when extracting text from PDF using conventional method are:  Special characters are not properly extracted.  Missing formatting including case changes.  Unwanted merging/splitting of paragraphs.  Content extracted in incorrect order.  Text in columns are mixed up. We inSPire success. 2
  • 3. Introduction • After doing a lot of R&D, SPi has come up with a new approach for extracting text from searchable PDF inputs. • SPiZONE tool was developed to have a generic workflow for OCR on raster PDF and scanned images, text extraction processes for searchable PDF. • Output of SPiZONE Verify is short-tagged text file. It can be further converted into any output format like XML, ePub etc. We inSPire success. 3
  • 4. Product Highlights • Text extraction is possible for all languages. • Text accuracy is more than 99.95%. • Table extraction along with column-spanning and row-spanning etc, based on user input. • Image extraction. • Options to mark some text as ‘Ignore Text’ within zones, so that it will not be produced in output. We inSPire success. 4
  • 5. PDF to Text using SPiZONE - Quick Workflow SZI Generator •SZI Generator (Server Process) SPiZONE Edit •Styling and Zoning Extraction •PDF to HTML (Sever Process) SPiZONE Verify We inSPire success. •Content QA 5
  • 6. SZI Generation • Sever based process • Input: PDF • Output: LowRes TIFF and SZI • SZI – Styling and Zoning Information We inSPire success. 6
  • 7. SPiZONE Edit • Styling and Zoning application • • Input: TIFF and SZI Output: SZI • User will identify the text to be extracted by drawing zones. When drawing zones, style names and sequence numbers and other properties, are assigned to each element. • These style names are used during post-extraction processing and during XML/ePub conversion • The zones information are saved in SZI file. We inSPire success. 7
  • 8. SPiZONE Edit -- DEMO We inSPire success. 8
  • 9. Text Extraction from PDF • Server based process. • • Input: PDF and SZI Output: HTML, SZD • SZD – SPiZONE Document used for logging. • Font details, uncertain space, soft-hyphens etc are flagged in the extracted file which are used by SPiZONE Verify. We inSPire success. 9
  • 10. SPiZONE Verify • OCR/Text Extraction QA application. • • Input: Extracted content in HTML format, SZI and LowRes TIFF. Output: Short-tagged files. • With this application user performs a regulated content checking on the extracted HTML files. • Font Normalization is used to make sure all the characters are extracted fine. User can correct the discrepancies if any. • Verify will not allow the user to create short-tagged file without normalizing all fonts and checking all uncertain space/soft-hyphens. • To see how SPIZONE Verify works, open the video on next slide. We inSPire success. 10
  • 11. SPiZONE Verify -- DEMO We inSPire success. 11
  • 12. Processing SPiZONE Output • PDF to Short-tagged text file creation workflow process is generic for all projects. • Short-tagged text files can be further converted into XML or ePub or any other format as per project requirement. • SPiZONE Structure is a customizable application which is used for conversion into any format like (but not limited to) XML, ePub etc. • Structure applications can be built in shorter period of time for any XML conversion project. • SPiZONE ePub application accepts short-tagged files as input to create ePub2/3. We inSPire success. 12
  • 13. SPiZONE Edit Samples We inSPire success. 13
  • 14. SPiZONE Edit Samples We inSPire success. 14
  • 15. SPiZONE Edit Samples We inSPire success. 15
  • 16. SPiZONE Verify Samples We inSPire success. 16
  • 17. SPiZONE Verify Samples We inSPire success. 17
  • 18. SPiZONE Verify Samples We inSPire success. 18
  • 19. SPiZONE Verify Samples We inSPire success. 19
  • 20. SPiZONE Verify Samples We inSPire success. 20
  • 21. ePUB Output Samples We inSPire success. 21
  • 22. ePUB Output Samples We inSPire success. 22
  • 23. ePUB Output Samples We inSPire success. 23
  • 24. Know more about PDF to ePUB conversion http://www.spi-global.com/content-solutions/our-services/publishingsolutions/conversion/convert-pdf-epub We inSPire success.