SlideShare una empresa de Scribd logo
1 de 28
Intelligently Extracting
Data from PDFs
Presented by Matt Kuznicki
Chief Technical Officer, Datalogics
Agenda
• Technical Challenges in PDF Data Extraction
• Key Considerations for Data Extraction
• Use Cases
• About Datalogics PDF Alchemist
About Me
• Chief Technical Officer at Datalogics
• Vice Chairman of PDF Association Board of Directors
• Worked extensively with PDF for over 15 years
• Active participant in the PDF standards community
Technical Challenges in PDF Data Extraction
Extraction: Technical Challenges
• PDF is a page description language – elements
typically have fixed position on a physical plane
• Elements are not necessarily defined in order of
appearance
• Richer vocabulary for expressing elements than other
formats
• Structure and semantics of elements not commonly
stated
At the time PDF was conceived in the 1990s, reliable rendering
for human readers was an important issue
• Focus was on retrieving the information needed to display
and print pages for peoples’ use
• Affordances to give content semantics came much later
• Community has made great strides in allowing for machine
interpretation, but proper use requires expertise in the domain
• Structure and semantics are optional – usage is still rare
• This is NOT a PDF specific issue
PDF as Page Description Language
PDF as Page Description Language
• PDF format most concerned with expressing exact visual
representation
• Elements are placed at fixed positions on virtual pages, in small
discrete pieces
• Not as fine-grained as individual dots in a raster file, but not as
continuous content like most HTML
• No guarantee of sentences or even letters grouped together to
form whole words in a PDF data stream
• Usually PDF files contain no information about how elements relate to
each other
PDF as Page Description Language
PDF pages often contain content that is a byproduct of
breaking data into page-size chunks, such as:
• Page numbers
• Page headers and footers
• Guides and information for printing
These elements are not usually considered real
document data, extracting these as content is usually
undesired.
Elements and Ordering
Small graphic elements can mean big extraction problems:
• Contents of a PDF page can be specified in an order very different
from how we read
• Humans automatically see a page flow that is not always present in
the PDF data stream
• Words, images and other elements on a page may have the marks
that constitute them spread far throughout the page marking stream
• Without ordering information, flow of PDF content must be
heuristically derived and is subject to differing interpretations
Richer Vocabulary For Elements
PDF includes a richer way to express elements than most other languages:
• Images can be in many different forms, including GIF, JPEG, PNG, JPEG 2000
and JBIG2 derived formats
• Fonts can be in several forms, including OpenType, TrueType, Type 1, CFF,
multiple master; or expressed in PDF element syntax
• Text may be expressed in a way that includes Unicode information – or in one of
hundreds of encodings – but no Unicode information is actually required
• Rich transparency and blending model allows for complex element interaction
• Content may be optionally present or absent from a page depending on a
number of different triggers and conditions
Structure and Semantics
Information on the structure and semantics of a PDF page is
usually not present:
• Lists are really just bunches of words and sometimes symbols
humans interpret as bullets or delimiters
• Tables are really just a series of lines and shaded boxes, and
bunches of words, that humans interpret together as rows,
columns and headers
• Paragraphs are really just bunches of words positioned on a
page in such a way that humans interpret them as sentences
grouped together
• Columns don’t exist in the PDF data stream, it’s just that us
humans see elements grouped in a way that suggests columns
Structure and Semantics
When creating PDFs, it is possible to include structure and
semantics into the PDF:
• Creating tagged PDF means the information for conversion is
included directly into the PDF when it’s created – at the right
time!
• Easy to convert tagged PDF into other formats and to reflow
• Not all tagged PDF is of good quality – and not all generators
emit useful tagged PDF!
Bottom line: you can’t count on getting PDF that has easily
extractable content!
Key Considerations For Data Extraction
Extraction: Key Considerations
• Content extraction means different things to different
audiences
• Know your audience and its goals
• Different goals are best met through different means
Extraction: Different Meanings
Let’s take a PDF that’s just one image of a scanned page:
Extraction: Different Meanings
Let’s take a PDF that’s just one image of a scanned page:
• Does extracting the content mean returning the image?
• Does extracting the content mean OCRing the image and
returning the text?
If the PDF is an image and text underneath – is the content the
image, the text, or both?
Know Your Audience’s Goals
Different audiences have different needs:
• Extraction for indexing or summarization typically requires a
pure text stream of paragraphs
• Extraction for loading contents into a database for machine
learning typically does not need appearance preservation
• Extraction for presentation on a different screen or medium
typically means content order should be preserved but the
appearance is expected to change
Different Goals, Different Means
Different goals mean different trade-offs:
• Indexing, machine learning, data mining – preservation of text
and reconstruction of semantics most important
• Reformatting for reflow or format conversion – balance between
text preservation and appearance preservation needed
• Reformatting for reliable viewing across devices – appearance
preservation most important, text preservation secondary
• Semantic reconstruction usually not required
Use Cases
Use Cases for Content Extraction
• Conversion to HTML for viewing PDF without a PDF viewer
• Converting PDF into a reflowable HTML representation
• Extraction of PDF contents for machine understanding
Viewing PDF Without a PDF Viewer
PDF extraction and conversion revolves around visual appearance:
• Extract content and into a 1 to 1 analogue in a different fixed
layout (HTML + SVG, raster image, print-out, etc.)
• Convert extracted content into different visual primatives
• Reliable viewing, but maintains disadvantages of PDF format
This is the simplest and easiest way to convert PDF content for
human reading – but doesn’t extract the content into a useful form
for machines
Converting PDF Into Reflowable HTML
PDF extraction and conversion balances needs of humans and
machine understanding:
• Elements are analyzed in page context and turned back into text
flows, lists, tables, and other structured elements
• Elements that can’t be expressed in HTML are usually rendered
to allow proper viewing, at the loss of search-ability
• Navigation elements – bookmarks, links – are converted into
HTML equivalents for easy browsing
• Pagination artifacts are discarded when possible
Resulting HTML is reflowable and gives good document reading
experience, but appearance typically changes somewhat to be
more “HTML-ish”
Extraction of PDF Contents For
Machine Understanding
PDF extraction focused on text and structure:
• Elements are analyzed in page context and turned back
into text flows, lists, tables, and other structured elements
• Text elements that can’t be expressed in HTML are usually
left as text, sacrificing visual fidelity
• Navigation elements – bookmarks, links – are converted so
that automated processes can crawl these
• Pagination artifacts should be discarded when possible
About Datalogics PDF Alchemist
Datalogics PDF Alchemist
• Works on untagged PDFs – handles existing PDFs, does not
require workflow changes or regenerating/reconstructing source
PDFs
• Turns placed words in PDFs back into reflowable text
• Re-creates tables and lists from page content
• Removes pagination artifacts such as page #s and running
headers
• Converts PDF into single-page HTML5 + CSS or into EPUB
packages
• Converts PDF forms into fixed-layout HTML forms for use in
mobile environments
Summary
Extracting Content from PDFs
Intelligently extracting content from PDF files requires:
• Seeing pages in a way like a human reads them
• Figuring our the logical structure of the pages
• Putting text back together into text flows
• Putting all these elements back together in the correct order
• Compensating intelligently for differences between PDF and
the chosen method of receiving content
Questions?
Matt Kuznicki
Chief Technical Officer
mattk@datalogics.com
LinkedIn: mattkuznicki
Datalogics Inc.
www.datalogics.com
Twitter: @DatalogicsInc

Más contenido relacionado

Último

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Intelligent Content Extraction from PDFs

  • 1. Intelligently Extracting Data from PDFs Presented by Matt Kuznicki Chief Technical Officer, Datalogics
  • 2. Agenda • Technical Challenges in PDF Data Extraction • Key Considerations for Data Extraction • Use Cases • About Datalogics PDF Alchemist
  • 3. About Me • Chief Technical Officer at Datalogics • Vice Chairman of PDF Association Board of Directors • Worked extensively with PDF for over 15 years • Active participant in the PDF standards community
  • 4. Technical Challenges in PDF Data Extraction
  • 5. Extraction: Technical Challenges • PDF is a page description language – elements typically have fixed position on a physical plane • Elements are not necessarily defined in order of appearance • Richer vocabulary for expressing elements than other formats • Structure and semantics of elements not commonly stated
  • 6. At the time PDF was conceived in the 1990s, reliable rendering for human readers was an important issue • Focus was on retrieving the information needed to display and print pages for peoples’ use • Affordances to give content semantics came much later • Community has made great strides in allowing for machine interpretation, but proper use requires expertise in the domain • Structure and semantics are optional – usage is still rare • This is NOT a PDF specific issue PDF as Page Description Language
  • 7. PDF as Page Description Language • PDF format most concerned with expressing exact visual representation • Elements are placed at fixed positions on virtual pages, in small discrete pieces • Not as fine-grained as individual dots in a raster file, but not as continuous content like most HTML • No guarantee of sentences or even letters grouped together to form whole words in a PDF data stream • Usually PDF files contain no information about how elements relate to each other
  • 8. PDF as Page Description Language PDF pages often contain content that is a byproduct of breaking data into page-size chunks, such as: • Page numbers • Page headers and footers • Guides and information for printing These elements are not usually considered real document data, extracting these as content is usually undesired.
  • 9. Elements and Ordering Small graphic elements can mean big extraction problems: • Contents of a PDF page can be specified in an order very different from how we read • Humans automatically see a page flow that is not always present in the PDF data stream • Words, images and other elements on a page may have the marks that constitute them spread far throughout the page marking stream • Without ordering information, flow of PDF content must be heuristically derived and is subject to differing interpretations
  • 10. Richer Vocabulary For Elements PDF includes a richer way to express elements than most other languages: • Images can be in many different forms, including GIF, JPEG, PNG, JPEG 2000 and JBIG2 derived formats • Fonts can be in several forms, including OpenType, TrueType, Type 1, CFF, multiple master; or expressed in PDF element syntax • Text may be expressed in a way that includes Unicode information – or in one of hundreds of encodings – but no Unicode information is actually required • Rich transparency and blending model allows for complex element interaction • Content may be optionally present or absent from a page depending on a number of different triggers and conditions
  • 11. Structure and Semantics Information on the structure and semantics of a PDF page is usually not present: • Lists are really just bunches of words and sometimes symbols humans interpret as bullets or delimiters • Tables are really just a series of lines and shaded boxes, and bunches of words, that humans interpret together as rows, columns and headers • Paragraphs are really just bunches of words positioned on a page in such a way that humans interpret them as sentences grouped together • Columns don’t exist in the PDF data stream, it’s just that us humans see elements grouped in a way that suggests columns
  • 12. Structure and Semantics When creating PDFs, it is possible to include structure and semantics into the PDF: • Creating tagged PDF means the information for conversion is included directly into the PDF when it’s created – at the right time! • Easy to convert tagged PDF into other formats and to reflow • Not all tagged PDF is of good quality – and not all generators emit useful tagged PDF! Bottom line: you can’t count on getting PDF that has easily extractable content!
  • 13. Key Considerations For Data Extraction
  • 14. Extraction: Key Considerations • Content extraction means different things to different audiences • Know your audience and its goals • Different goals are best met through different means
  • 15. Extraction: Different Meanings Let’s take a PDF that’s just one image of a scanned page:
  • 16. Extraction: Different Meanings Let’s take a PDF that’s just one image of a scanned page: • Does extracting the content mean returning the image? • Does extracting the content mean OCRing the image and returning the text? If the PDF is an image and text underneath – is the content the image, the text, or both?
  • 17. Know Your Audience’s Goals Different audiences have different needs: • Extraction for indexing or summarization typically requires a pure text stream of paragraphs • Extraction for loading contents into a database for machine learning typically does not need appearance preservation • Extraction for presentation on a different screen or medium typically means content order should be preserved but the appearance is expected to change
  • 18. Different Goals, Different Means Different goals mean different trade-offs: • Indexing, machine learning, data mining – preservation of text and reconstruction of semantics most important • Reformatting for reflow or format conversion – balance between text preservation and appearance preservation needed • Reformatting for reliable viewing across devices – appearance preservation most important, text preservation secondary • Semantic reconstruction usually not required
  • 20. Use Cases for Content Extraction • Conversion to HTML for viewing PDF without a PDF viewer • Converting PDF into a reflowable HTML representation • Extraction of PDF contents for machine understanding
  • 21. Viewing PDF Without a PDF Viewer PDF extraction and conversion revolves around visual appearance: • Extract content and into a 1 to 1 analogue in a different fixed layout (HTML + SVG, raster image, print-out, etc.) • Convert extracted content into different visual primatives • Reliable viewing, but maintains disadvantages of PDF format This is the simplest and easiest way to convert PDF content for human reading – but doesn’t extract the content into a useful form for machines
  • 22. Converting PDF Into Reflowable HTML PDF extraction and conversion balances needs of humans and machine understanding: • Elements are analyzed in page context and turned back into text flows, lists, tables, and other structured elements • Elements that can’t be expressed in HTML are usually rendered to allow proper viewing, at the loss of search-ability • Navigation elements – bookmarks, links – are converted into HTML equivalents for easy browsing • Pagination artifacts are discarded when possible Resulting HTML is reflowable and gives good document reading experience, but appearance typically changes somewhat to be more “HTML-ish”
  • 23. Extraction of PDF Contents For Machine Understanding PDF extraction focused on text and structure: • Elements are analyzed in page context and turned back into text flows, lists, tables, and other structured elements • Text elements that can’t be expressed in HTML are usually left as text, sacrificing visual fidelity • Navigation elements – bookmarks, links – are converted so that automated processes can crawl these • Pagination artifacts should be discarded when possible
  • 24. About Datalogics PDF Alchemist
  • 25. Datalogics PDF Alchemist • Works on untagged PDFs – handles existing PDFs, does not require workflow changes or regenerating/reconstructing source PDFs • Turns placed words in PDFs back into reflowable text • Re-creates tables and lists from page content • Removes pagination artifacts such as page #s and running headers • Converts PDF into single-page HTML5 + CSS or into EPUB packages • Converts PDF forms into fixed-layout HTML forms for use in mobile environments
  • 27. Extracting Content from PDFs Intelligently extracting content from PDF files requires: • Seeing pages in a way like a human reads them • Figuring our the logical structure of the pages • Putting text back together into text flows • Putting all these elements back together in the correct order • Compensating intelligently for differences between PDF and the chosen method of receiving content
  • 28. Questions? Matt Kuznicki Chief Technical Officer mattk@datalogics.com LinkedIn: mattkuznicki Datalogics Inc. www.datalogics.com Twitter: @DatalogicsInc