More Related Content
Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison (20)
More from OpenSource Connections (20)
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
- 1. © 2019 The MITRE Corporation. All rights reserved.
Apache Tika
Tim Allison
tallison@apache.org, @_tallison
April 24, 2019
Haystack Conference
Approved for Public Release;
Distribution Unlimited. Case
Number 18-3138-6
- 2. | 2 |
© 2019 The MITRE Corporation. All rights reserved.
Overview
▪ What is Tika
▪ tika-eval
▪ Running Tika safely
▪ Coming out in 1.21 and beyond
- 3. | 3 |
© 2019 The MITRE Corporation. All rights reserved.
Text/Metadata Extraction
- 4. | 4 |
© 2019 The MITRE Corporation. All rights reserved.
Things Can Happen
▪ Tired:
– Exceptions
– Unsupported file formats
– Encrypted files
– Garbled text
– Missing text
▪ Wired:
– OOM
– Seg fault
– Infinite loops
– Multithreaded garbage collector pegging all CPU resources
- 6. | 6 |
© 2019 The MITRE Corporation. All rights reserved.
Upgrade from PDFBox 1.8.6->1.8.7
- 7. | 7 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
- 8. | 8 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
You don’t have a search system.
- 9. | 9 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
👍You’ve got a neat, little demo!👍
You don’t have a search system.
- 11. | 11 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval
▪ Profile individual runs
▪ Compare two runs
▪ Exceptions by mime
▪ Out of vocabulary (OOV) statistics
- 12. | 12 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval: Eating our own dog food
▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a
public virtual machine, provided by Rackspace
▪ Code to profile a single run or compare two runs before release
▪ Evaluation methodology co-developed with and now co-run by open
source colleagues (around the world) on the MSOffice parser project
and the PDF parser project
- 13. | 13 |
© 2019 The MITRE Corporation. All rights reserved.
Tika 1.21 and beyond
▪ Tika 1.21
– csv/tsv detector and parser (Apache commons-csv)
– Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing
▪ Beyond
– Modularize tika-eval and include stats within the extract for scalability and aggregation of
stats w/in Solr/Elastic
– Increase coverage/speed of zip-based file detection; can we move entirely to streaming
detection?
– Improve language coverage/lang id component w/in tika-eval
▪ Help!
– What do you need?
– How can you help us help you?