SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
static void
                       _f_do_barnacle_install_properties(GObjectClass
                                                       *gobject_class)
                                                                      {




OCRFeeder
                                                 GParamSpec *pspec;


                                               /* Party code attribute */
                                        pspec = g_param_spec_uint64
                                          (F_DO_BARNACLE_CODE,
                                                      "Barnacle code.",
                                                       "Barnacle code",
                                                                      0,
                                                       G_MAXUINT64,
                                                      G_MAXUINT64 /*
                                                        default value */,

Documents conversion on GNOME
                                               G_PARAM_READABLE
                                            | G_PARAM_WRITABLE |
                                                G_PARAM_PRIVATE);

                        g_object_class_install_property (gobject_class,

                                   F_DO_BARNACLE_PROP_CODE,



Joaquim Rocha
jrocha@igalia.com




  FOSDEM 2010
What is it?

Document Analysis and Optical
   Character Recognition
        for GNOME


                   Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Why is it?

 Paper has a number of problems

No applications for GNU/Linux to do
             a fair job

                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Paper problems:
   Security




                        Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
          CC Photo by: http://www.flickr.com/photos/badwsky/
Paper problems:
 Preservation




                        Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
   CC Photo by: http://www.flickr.com/photos/98469445@N00/
Paper problems:
Data processing




                     Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
     CC Photo by: http://www.flickr.com/photos/hugovk/
Paper problems:
   Ecology




                          Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
        CC Photo by: http://www.flickr.com/photos/pranavsingh/
No fair conversion apps for
          GNU/Linux

apart from OCR engines, but...



                   Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCR != Document Conversion

    (it only deals with chars)
 (does not consider the layout)
(does not distinguish contents)


                    Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
what you want is

  Document Analysis and
      Recognition

(conversion of documents to an
        electronic format)
   (first projects in the 80s)
                   Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Where are were we at?

   * Some closed solutions
* Only for proprietary systems
        * Various prices
   * still... arguable results

                   Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How?




       Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
How




      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Base concept:

          1. Clip the contents
            2. Classify them
 2.1. They are graphics → Paste on
               document
2.2. They are text → Calculate letter
       size; paste on document
                       Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
So many layouts...




                              Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
          CC Photo by: http://www.flickr.com/photos/uber-tuber/
Layouts vary with the type of
            document

What works on detecting one, won't
         work on others


                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
OCRFeeder focus on contents, not
         on layouts!




                    Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Key concept:

   If a document image can be
divided in windows of 1 (content)
         or 0 (not content),
then it is possible to group all the
    1s and outline the contents

                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:

1. A NxN pixel window runs through
 the document top to bottom, left to
                right




                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:

2. For every iteration, if there's a
 pixel inside the window which
 contrasts with the background,
    then the window gets a 1,
       otherwise it gets a 0

                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:

It does not check all the pixels so
   there is a better performance




                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:

     3. After all windows
   have a value assigned,
the ones with the value 1 are
            grouped

                  Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:

 4. Every time a set of 1s is
  grouped, each window is
   reassigned the value 0


  (these are called blocks)
                   Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Sliding Window Algorithm:

 5. When all windows have
 the value 0, the algorithm
      reached the end


                  Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Block structure:




            Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Joining Blocks:

Blocks are check with each other
  and joined when appropriate

When no blocks can be joined the
   analysis part is finished


                    Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Recognition:

System-wide OCR engines are used

 Engines are configured from the
        GUI or XML files

                     Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Engine configuration:

<?xml version="1.0" encoding="UTF-8"?>
<engine>
   <name>Tesseract</name>
   <image_format>TIFF</image_format>
   <engine_path>/usr/bin/tesseract</engine_path>
   <arguments>$IMAGE $FILE; cat $FILE.txt;
               rm $FILE.txt</arguments>
</engine>



                                  Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Classification:


          It is graphics if:

          * Text is empty

  * More than 50% of the chars are
failure chars, punctuation or spaces
                        Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:

“Measures” in pixels the size of
        each text line




 Although it results in different
             sizes...
                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:

The value equal or greater than the
        average is chosen


(results in values equal or close to
       the original font size)
                       Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Font Size Detection:

The font size is calculated in inches
      using the resolution (DPI)
(if there's no resultion info, assume
               300 DPI)

 The value is then divided by the
DTP (DeskTop Publishing point): 72
                       Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Exportation to ODT:

     Uses ODFPy

(abstracts ODF creation)
   (just above XML)

                Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
User interaction:

   Users can edit everything
and review the algorithm's results

So, UI can work in attended and
        unattended ways
CLI only works in an unattended
              mode
                     Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
ABBY Finereader test




Nuance Omnipage test




                       Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Finereader's
                                              results


Omnipage's
results




             Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Demo time!




         Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Other features:

   * PDF importation
* Unpaper preprocessor
   * Font style edition
 * Exportation to HTML
* Project saving/loading

                Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Future:

     * Integrate Ocropus as an
    alternative analysis backend
* More exportation formats: HOCR,
               txt, PDF
           * Improved a11y
  * Better integration with GNOME
       and other GNOME apps
                      Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
GNOME:

Development moved to GNOME's
 infrastructure since last month




                    Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Webpage:
http://live.gnome.org/OCRFeeder

               git:
 http://git.gnome.org/ocrfeeder

          Bugzilla:
        coming soon...
                    Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
Thank you!

         Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010

Más contenido relacionado

Similar a OCRFeeder, documents conversion on GNOME

OCRFeeder (FOSDEM 2010)
OCRFeeder (FOSDEM 2010)OCRFeeder (FOSDEM 2010)
OCRFeeder (FOSDEM 2010)Igalia
 
OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
OCRFeeder - OCR made easy on GNOME (GUADEC 2012)OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
OCRFeeder - OCR made easy on GNOME (GUADEC 2012)Igalia
 
Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)
Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)
Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)Igalia
 
Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...
Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...
Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...Igalia
 
SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)
SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)
SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)Igalia
 
iPhone Development For .Net Dev
iPhone Development For .Net DeviPhone Development For .Net Dev
iPhone Development For .Net DevAlex Hung
 
Shaping the Future of Automatic Programming
Shaping the Future of Automatic ProgrammingShaping the Future of Automatic Programming
Shaping the Future of Automatic ProgrammingChristos Tsakostas
 
The Chromium/Wayland project (Web Engines Hackfest 2017)
The Chromium/Wayland project (Web Engines Hackfest 2017)The Chromium/Wayland project (Web Engines Hackfest 2017)
The Chromium/Wayland project (Web Engines Hackfest 2017)Igalia
 
44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...
44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...
44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...44CON
 
The Chromium project's Way to Wayland (FOSDEM 2018)
The Chromium project's Way to Wayland (FOSDEM 2018)The Chromium project's Way to Wayland (FOSDEM 2018)
The Chromium project's Way to Wayland (FOSDEM 2018)Igalia
 
The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)
The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)
The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)Igalia
 
The Chromium/Wayland Project (BlinkOn 9)
The Chromium/Wayland Project (BlinkOn 9)The Chromium/Wayland Project (BlinkOn 9)
The Chromium/Wayland Project (BlinkOn 9)Igalia
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io
 
A intro to (hosted) Shiny Apps
A intro to (hosted) Shiny AppsA intro to (hosted) Shiny Apps
A intro to (hosted) Shiny AppsDaniel Koller
 
Are app servers still fascinating
Are app servers still fascinatingAre app servers still fascinating
Are app servers still fascinatingAntonio Goncalves
 
Import golang; struct microservice
Import golang; struct microserviceImport golang; struct microservice
Import golang; struct microserviceGiulio De Donato
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Runwesley chun
 

Similar a OCRFeeder, documents conversion on GNOME (20)

OCRFeeder (FOSDEM 2010)
OCRFeeder (FOSDEM 2010)OCRFeeder (FOSDEM 2010)
OCRFeeder (FOSDEM 2010)
 
OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
OCRFeeder - OCR made easy on GNOME (GUADEC 2012)OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
 
Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)
Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)
Converting printed documents into digital formats with OCRFeeder (LinuxTag 2011)
 
Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...
Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...
Grilo: Integration of Multimedia Contents in Applications Made Easy (FOSDEM 2...
 
SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)
SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)
SeriesFinale, a TV shows' tracker for Maemo 5 (FOSDEM 2010)
 
iPhone Development For .Net Dev
iPhone Development For .Net DeviPhone Development For .Net Dev
iPhone Development For .Net Dev
 
Shaping the Future of Automatic Programming
Shaping the Future of Automatic ProgrammingShaping the Future of Automatic Programming
Shaping the Future of Automatic Programming
 
The Chromium/Wayland project (Web Engines Hackfest 2017)
The Chromium/Wayland project (Web Engines Hackfest 2017)The Chromium/Wayland project (Web Engines Hackfest 2017)
The Chromium/Wayland project (Web Engines Hackfest 2017)
 
44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...
44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...
44CON 2013 - Browser bug hunting - Memoirs of a last man standing - Atte Kett...
 
The Chromium project's Way to Wayland (FOSDEM 2018)
The Chromium project's Way to Wayland (FOSDEM 2018)The Chromium project's Way to Wayland (FOSDEM 2018)
The Chromium project's Way to Wayland (FOSDEM 2018)
 
The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)
The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)
The Chromium/Wayland Project (Lightning Talk) (BlinkOn 9)
 
The Chromium/Wayland Project (BlinkOn 9)
The Chromium/Wayland Project (BlinkOn 9)The Chromium/Wayland Project (BlinkOn 9)
The Chromium/Wayland Project (BlinkOn 9)
 
Mender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and GolangMender.io | Develop embedded applications faster | Comparing C and Golang
Mender.io | Develop embedded applications faster | Comparing C and Golang
 
A intro to (hosted) Shiny Apps
A intro to (hosted) Shiny AppsA intro to (hosted) Shiny Apps
A intro to (hosted) Shiny Apps
 
Are app servers still fascinating
Are app servers still fascinatingAre app servers still fascinating
Are app servers still fascinating
 
iBizLog - ESUG2010
iBizLog - ESUG2010iBizLog - ESUG2010
iBizLog - ESUG2010
 
Cork JUG - Drools basics &amp; pitfalls
Cork JUG - Drools basics &amp; pitfallsCork JUG - Drools basics &amp; pitfalls
Cork JUG - Drools basics &amp; pitfalls
 
Xdebug
XdebugXdebug
Xdebug
 
Import golang; struct microservice
Import golang; struct microserviceImport golang; struct microservice
Import golang; struct microservice
 
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud RunDesigning flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
Designing flexible apps deployable to App Engine, Cloud Functions, or Cloud Run
 

Último

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Último (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

OCRFeeder, documents conversion on GNOME

  • 1. static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { OCRFeeder GParamSpec *pspec; /* Party code attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, Documents conversion on GNOME G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com FOSDEM 2010
  • 2. What is it? Document Analysis and Optical Character Recognition for GNOME Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 3. Why is it? Paper has a number of problems No applications for GNU/Linux to do a fair job Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 4. Paper problems: Security Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/badwsky/
  • 5. Paper problems: Preservation Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/98469445@N00/
  • 6. Paper problems: Data processing Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/hugovk/
  • 7. Paper problems: Ecology Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/pranavsingh/
  • 8. No fair conversion apps for GNU/Linux apart from OCR engines, but... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 9. OCR != Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 10. what you want is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 11. Where are were we at? * Some closed solutions * Only for proprietary systems * Various prices * still... arguable results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 12. How? Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 13. How Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 14. Base concept: 1. Clip the contents 2. Classify them 2.1. They are graphics → Paste on document 2.2. They are text → Calculate letter size; paste on document Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 15. So many layouts... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010 CC Photo by: http://www.flickr.com/photos/uber-tuber/
  • 16. Layouts vary with the type of document What works on detecting one, won't work on others Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 17. OCRFeeder focus on contents, not on layouts! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 18. Key concept: If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 19. Sliding Window Algorithm: 1. A NxN pixel window runs through the document top to bottom, left to right Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 20. Sliding Window Algorithm: 2. For every iteration, if there's a pixel inside the window which contrasts with the background, then the window gets a 1, otherwise it gets a 0 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 21. Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 22. Sliding Window Algorithm: It does not check all the pixels so there is a better performance Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 23. Sliding Window Algorithm: 3. After all windows have a value assigned, the ones with the value 1 are grouped Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 24. Sliding Window Algorithm: 4. Every time a set of 1s is grouped, each window is reassigned the value 0 (these are called blocks) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 25. Sliding Window Algorithm: 5. When all windows have the value 0, the algorithm reached the end Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 26. Block structure: Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 27. Joining Blocks: Blocks are check with each other and joined when appropriate When no blocks can be joined the analysis part is finished Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 28. Recognition: System-wide OCR engines are used Engines are configured from the GUI or XML files Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 29. Engine configuration: <?xml version="1.0" encoding="UTF-8"?> <engine> <name>Tesseract</name> <image_format>TIFF</image_format> <engine_path>/usr/bin/tesseract</engine_path> <arguments>$IMAGE $FILE; cat $FILE.txt; rm $FILE.txt</arguments> </engine> Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 30. Classification: It is graphics if: * Text is empty * More than 50% of the chars are failure chars, punctuation or spaces Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 31. Font Size Detection: “Measures” in pixels the size of each text line Although it results in different sizes... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 32. Font Size Detection: The value equal or greater than the average is chosen (results in values equal or close to the original font size) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 33. Font Size Detection: The font size is calculated in inches using the resolution (DPI) (if there's no resultion info, assume 300 DPI) The value is then divided by the DTP (DeskTop Publishing point): 72 Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 34. Exportation to ODT: Uses ODFPy (abstracts ODF creation) (just above XML) Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 35. User interaction: Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 36. ABBY Finereader test Nuance Omnipage test Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 37. Finereader's results Omnipage's results Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 38. Demo time! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 39. Other features: * PDF importation * Unpaper preprocessor * Font style edition * Exportation to HTML * Project saving/loading Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 40. Future: * Integrate Ocropus as an alternative analysis backend * More exportation formats: HOCR, txt, PDF * Improved a11y * Better integration with GNOME and other GNOME apps Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 41. GNOME: Development moved to GNOME's infrastructure since last month Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 42. Webpage: http://live.gnome.org/OCRFeeder git: http://git.gnome.org/ocrfeeder Bugzilla: coming soon... Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
  • 43. Thank you! Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010