Presentation of the proceeding article "Hybrid Page Layout Analysis via Tab-Stop Detection" by Ray Smith to the Page Segmentation Competition hold on ICDAR 2009.
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"
1. Presentation of
Hybrid Page Layout Analysis via Tab-Stop Detection
Ray Smith, Proc. ICDAR2009, Barcelona, Spain, 2009.
Javier de la Rosa {jdelaros at uwo dotca}
CS 9883
2. 2 | Internal use only2 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
Index.
1. Context and background.
2. Introduction.
3. Page layout via tab-stop detection.
4. Preprocessing.
5. Finding tab positions as line segments.
6. Finding the column layout.
7. Finding the regions.
8. Testing and results.
9. Conclusion and further work.
10. Criticism.
11. References.
3. 3 | Internal use only3 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
1. Context and background.
• International Conference on Document Analysis and
Recognition [1].
• Page Segmentation competitions: 2001, 2003, 2005, 2007
and 2009 [2].
• Tesseract, the OCR from Google [3].
Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> [1]
A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> [2]
The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]
4. 4 | Internal use only4 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
2. Introduction.
Physical page layout analysis:
• Bottom-up [4].
• Top-down [5].
• Whitespaces [6].
Logical page layout analysis:
• Voronoi.
• Smearing.
• Etc.
M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408. [4]
G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. [5]
T.M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. [6]
5. 5 | Internal use only5 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
3. Tab-stop detection.
• Regions bounded by tab-stops.
• Fixed x-positions.
• Vertical alignment.
Phases:
1. Preprocessing.
2. Bottom-up tab-stop detections.
3. Finding the column layout.
4. Set of typed regions.
6. 6 | Internal use only6 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
4. Preprocessing
• Detection of vertical lines and image mask [7].
• Connected components (CCs) analysis.
• CCs filtering by width, w, and height, h:
– Small: h < 7 (@300ppi) or h < h75 / 2
– Large: h > 2h75 or w > 8h75
– Medium: rest of reminder.
Leptonica image processing and analysis library <http://www.leptonica.com> [7]
7. 7 | Internal use only7 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
5. Finding the positions as line segments. (1/3)
• Candidate tab-stop components:
– A CC is a tab-stop by default.
– Look for aligned neighbours.
– Mark each CC as left tab, right tab or
neither.
• Grouping candidate tabs:
– In lines and, if there are many, in groups.
– Least median of squares to fit the lines
(left or right).
– Refit lines to the page-mean direction.
8. 8 | Internal use only8 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
5. Finding the positions as line segments. (2/3)
• Tracking text lines to connect tab-stops:
– From one tab-stop to another.
– Associate tab-stops connected by text
lines.
– Discard tab-stop with no connections.
– Record the most frequently occurring
text lines widths.
9. 9 | Internal use only9 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
5. Finding the positions as line segments. (3/3)
• Cleaning up tab-stop ends:
– Make connected tab lines end at the same y
coordinate:
– Moving the ends between the last member
CC and the first non-member CC.
• Reclassify CC as “Text” or “Unknown”:
– A CCs group of significant with form a text
line.
• Create artificial CCs from the image mask.
10. 10 | Internal use only10 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
6. Finding the column layout. (1/3)
• Scan CCs from left to right and top to down, gathering into
Column Partitions (CPs).
• A CP may not cross a tab-stop line.
• Collections of CPs are stored in Column Partition Sets
(CPsets).
• Find the column layout → find an optimal set of CPsets that
best “explains” all the CPsets on the page.
11. 11 | Internal use only11 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
6. Finding the column layout. (2/3)
• A good CP: it touches a tab line on both vertical edges.
• A good CP: its width is closely to frequency occurring width (slide 8).
• The coverage of CPset = total width of all the good CPs that it contains.
• A CPset A is better than CPset B if A has greater coverage.
• What does it mean “explain”? In a short:
– CPset A explains CPset B unless one or more of the following are true:
• B hasn't more text than A.
• A hasn't split a column fo common width.
• A hasn't a different number of columns to B.
• A hasn't merged two columns of B.
12. 12 | Internal use only12 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
6. Finding the column layout. (3/3)
• List from set of CPsets on
the page.
• Ordered by best ones first.
• Duplicates eliminated by
the A explains B rules.
• Image CPs are ignored.
• Improve the candidates
adding new CPs.
13. 13 | Internal use only13 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
7. Finding the regions. (1/3)
• Create flows of CPs:
– Choose the best matching upper and
lower partner.
– The list of partners is forced to become
zero or one iteratively.
– Different rules for image CPs and text
CPs.
– Each chain of CPs returned represents a
candidate region:
• Text is blue.
• Heading text is cyan.
• Heading image is magenta.
• Pull-out image is orange.
14. 14 | Internal use only14 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
7. Finding the regions. (2/3)
● The rules to apply:
1. Type. If there are multiple types, text can only stay with its own (exact) type,
whereas image any other image type.
2. Transitive partner shortcuts are broken. If A has 2 partners B and C, and also B
has C as a partner in the same direction, then delete C as a partner of A,
leaving a clean chain A-B-C. Also if A has a partner B, and B has a partner A in
the same direction, break the cycle.
3. (Text only) If A still has 2 partners B, C, chase B and C's partners to see which
has the longest chain. Delete from A the partner that has the shortest chain,
and convert the type of the shortest chain to pull-out.
4. (Image only) Choose the partner CP with the largest horizontal overlap.
15. 15 | Internal use only15 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
7. Finding the regions. (3/3)
• Determinate the order reading:
1. Flowing blocks follow by y position within a column.
2. Pull-out blocks follow by y position in an imaginary column
between the real columns that they touch.
3. A heading spans multiple columns and follows anything that is
above it in the columns spanned, or between them.
4. A change in column layout works just like a heading.
5. Between headings, the content of columns is ordered from
left to right.
• Find the polygon boundary for each region:
–
Polygons are isothetic.
–
Polygon edges are chosen to minimize the number of
vertices.
–
All CPs are contained within their region.
16. 16 | Internal use only16 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
8. Testing and results. (1/2)
• Algorithm implemented in C++.
• Part of Tesseract Open Source OCR system [3].
• 1 image of 8MPixel per second on a 3.4GHz Pentium 4.
17. 17 | Internal use only17 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
8. Testing and results. (2/2)
Noise Sep Text Image Overall
0
10
20
30
40
50
60
70
80
90
100
PRImA Metric
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
Noise Sep Text Image Overall
0
20
40
60
80
100
120
F-Measure
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
Noise Sep Text Image Overall
0
20
40
60
80
100
120
Recall
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
Noise Sep Text Image Overall
0
20
40
60
80
100
120
Precission
2007-Besus
2007-TH1
2007-TH2
Tesseract
Measure
Method
ICDAR 2007 set [2, 8]
A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283. [8]
18. 18 | Internal use only18 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
9. Conclusion and further work
• Tab-stop make an interesting and useful alternative to
white rectangles.
• It enables page layout analysis to easily handle the
complex non-rectangular layouts of modern magazines.
• Table detection will be added in the future.
19. 19 | Internal use only19 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (1/4)
• The idea is totally new and it works reasonably well, but
• No references.
• No formulas.
• No algorithms.
• No mathematical justification.
• Excess text and literature.
• Process too long and with no justifications in many
occasions.
20. 20 | Internal use only20 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (2/4)
• An example:
– Preprocessing: Small CCs: h < 7 (@300ppi) ...
• Why 7?
• Does it only work at 300ppi?
• Only on magazine papers (10.5” x78.5”)?
21. 21 | Internal use only21 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (3/4)
• More:
– Reclassify CC as “Text” or “Unknown”: A CCs group of
significant width form a text line.
• What's a “significant width”?
– Find the polygon boundary for each region: Polygon
edges are choosen to minimize the number of
vertices.
• What's the algorithm or reference to do this?
22. 22 | Internal use only22 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
10. Criticism. (4/4)
ICDAR 2009 Results [2]
23. 23 | Internal use only23 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
11. References. (1/2)
1. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011)
<http://www.icdar2011.org/>
2. A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona,
Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/>
3. The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]
4. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,”
SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408.
24. 24 | Internal use only24 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
11. References. (2/2)
5. G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf.
on Pattern Recognition, Montreal, Canada, 1984, pp347-349.
6. T. M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on
Document Analysis Systems V, Springer-Verlag 2002, pp188-199.
7. Leptonica image processing and analysis library
<http://www.leptonica.com>
8. A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc.
Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283.
25. 25 | Internal use only25 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883
Questions?
Thank you