By Joaquim Rocha.
OCRFeeder is a document layout analysis and optical character recognition system that I wrote for my Master's Thesis project.
Like it says on its website, given the images it will automatically outline its contents, distinguish between what's graphics and text and perform OCR over the latter. It generates multiple formats being its main one ODT.
I think this is currently the most complete and user friendly OCR application for GNU/Linux out there and, of course, I wrote it to be used mainly with GNOME, featuring a GUI written in PyGTK and respecting, as far as I could, the GNOME User Interface Guidelines.
I would like to present how the application works on the inside, for example the page segmentation algorithm I created for it, etc. I think this would be interest for the GNOME community and general attendants of the GNOME Dev room at FOSDEM.
8. No fair conversion apps for
GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
9. OCR != Document Conversion
(it only deals with chars)
(does not consider the layout)
(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
10. what you want is
Document Analysis and
Recognition
(conversion of documents to an
electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
11. Where are were we at?
* Some closed solutions
* Only for proprietary systems
* Various prices
* still... arguable results
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
14. Base concept:
1. Clip the contents
2. Classify them
2.1. They are graphics → Paste on
document
2.2. They are text → Calculate letter
size; paste on document
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
15. So many layouts...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
CC Photo by: http://www.flickr.com/photos/uber-tuber/
16. Layouts vary with the type of
document
What works on detecting one, won't
work on others
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
17. OCRFeeder focus on contents, not
on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
18. Key concept:
If a document image can be
divided in windows of 1 (content)
or 0 (not content),
then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
19. Sliding Window Algorithm:
1. A NxN pixel window runs through
the document top to bottom, left to
right
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
20. Sliding Window Algorithm:
2. For every iteration, if there's a
pixel inside the window which
contrasts with the background,
then the window gets a 1,
otherwise it gets a 0
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
22. Sliding Window Algorithm:
It does not check all the pixels so
there is a better performance
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
23. Sliding Window Algorithm:
3. After all windows
have a value assigned,
the ones with the value 1 are
grouped
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
24. Sliding Window Algorithm:
4. Every time a set of 1s is
grouped, each window is
reassigned the value 0
(these are called blocks)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
25. Sliding Window Algorithm:
5. When all windows have
the value 0, the algorithm
reached the end
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
27. Joining Blocks:
Blocks are check with each other
and joined when appropriate
When no blocks can be joined the
analysis part is finished
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
30. Classification:
It is graphics if:
* Text is empty
* More than 50% of the chars are
failure chars, punctuation or spaces
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
31. Font Size Detection:
“Measures” in pixels the size of
each text line
Although it results in different
sizes...
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
32. Font Size Detection:
The value equal or greater than the
average is chosen
(results in values equal or close to
the original font size)
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
33. Font Size Detection:
The font size is calculated in inches
using the resolution (DPI)
(if there's no resultion info, assume
300 DPI)
The value is then divided by the
DTP (DeskTop Publishing point): 72
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
35. User interaction:
Users can edit everything
and review the algorithm's results
So, UI can work in attended and
unattended ways
CLI only works in an unattended
mode
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
39. Other features:
* PDF importation
* Unpaper preprocessor
* Font style edition
* Exportation to HTML
* Project saving/loading
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
40. Future:
* Integrate Ocropus as an
alternative analysis backend
* More exportation formats: HOCR,
txt, PDF
* Improved a11y
* Better integration with GNOME
and other GNOME apps
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010
41. GNOME:
Development moved to GNOME's
infrastructure since last month
Joaquim Rocha (Igalia) · OCRFeeder · FOSDEM 2010