SlideShare una empresa de Scribd logo
1 de 16
MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor
Agenda The Problem The Solution The Project The Client
The Problem PDFBox Apache POI Apache Xerces ICU4J NekoHTML etc. Lucene index
It's even worse! Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?
Agenda The Problem The Solution The Project The Client
The Solution: Technical ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Solution: Legal / Social ,[object Object],[object Object],[object Object],[object Object],[object Object]
Agenda The Problem The Solution The Project The Client
Project Status ,[object Object],[object Object],[object Object],[object Object],[object Object]
Current Features ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Project Statistics
Agenda The Problem The Solution The Project The Client
Tika Parser API ,[object Object],[object Object],[object Object]
Example: Text extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Demo: Tika GUI
Agenda The Problem The Solution The Project The Client Thank You!

Más contenido relacionado

La actualidad más candente

Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
Do The Right Thing! How LDAP servers should help LDAP clients
Do The Right Thing! How LDAP servers should help LDAP clientsDo The Right Thing! How LDAP servers should help LDAP clients
Do The Right Thing! How LDAP servers should help LDAP clients
LDAPCon
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 

La actualidad más candente (20)

Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.Presto - Analytical Database. Overview and use cases.
Presto - Analytical Database. Overview and use cases.
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Another backend storage solution for the APM system
Another backend storage solution for the APM systemAnother backend storage solution for the APM system
Another backend storage solution for the APM system
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
New feature of Apache ShardingSphere 5.x
New feature of Apache ShardingSphere 5.xNew feature of Apache ShardingSphere 5.x
New feature of Apache ShardingSphere 5.x
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
An introduction into Oracle VM V3.x
An introduction into Oracle VM V3.xAn introduction into Oracle VM V3.x
An introduction into Oracle VM V3.x
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Globus Connect Server 5.1 Webinar
Globus Connect Server 5.1 WebinarGlobus Connect Server 5.1 Webinar
Globus Connect Server 5.1 Webinar
 
(Re)Indexing Large Repositories in Alfresco
(Re)Indexing Large Repositories in Alfresco(Re)Indexing Large Repositories in Alfresco
(Re)Indexing Large Repositories in Alfresco
 
Do The Right Thing! How LDAP servers should help LDAP clients
Do The Right Thing! How LDAP servers should help LDAP clientsDo The Right Thing! How LDAP servers should help LDAP clients
Do The Right Thing! How LDAP servers should help LDAP clients
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
Apache ManifoldCF
Apache ManifoldCFApache ManifoldCF
Apache ManifoldCF
 
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
 

Destacado

Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
 
Build single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEMBuild single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEM
connectwebex
 
Microservices Architecture for AEM
Microservices Architecture for AEMMicroservices Architecture for AEM
Microservices Architecture for AEM
Maciej Majchrzak
 
New Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthNew Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael Marth
AEM HUB
 

Destacado (20)

/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
The architecture of oak
The architecture of oakThe architecture of oak
The architecture of oak
 
Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiBuilding Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
 
Into the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep DiveInto the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep Dive
 
Build single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEMBuild single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEM
 
JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?
 
Introduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsIntroduction to Sightly and Sling Models
Introduction to Sightly and Sling Models
 
Oak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryOak, the Architecture of the new Repository
Oak, the Architecture of the new Repository
 
Multi site manager
Multi site managerMulti site manager
Multi site manager
 
Adobe Meetup AEM Architecture Sydney 2015
Adobe Meetup AEM Architecture Sydney 2015Adobe Meetup AEM Architecture Sydney 2015
Adobe Meetup AEM Architecture Sydney 2015
 
Microservices Architecture for AEM
Microservices Architecture for AEMMicroservices Architecture for AEM
Microservices Architecture for AEM
 
New Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthNew Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael Marth
 
Marek
MarekMarek
Marek
 
Ježiš v komunite
Ježiš v komuniteJežiš v komunite
Ježiš v komunite
 

Similar a Mime Magic With Apache Tika

CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
Suite Solutions
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
stat
 

Similar a Mime Magic With Apache Tika (20)

Apache Tika
Apache TikaApache Tika
Apache Tika
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Apache tika
Apache tikaApache tika
Apache tika
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
The Big Documentation Extravaganza
The Big Documentation ExtravaganzaThe Big Documentation Extravaganza
The Big Documentation Extravaganza
 
TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010TSPUG: Content Management in SharePoint 2010
TSPUG: Content Management in SharePoint 2010
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Spring Batch Introduction
Spring Batch IntroductionSpring Batch Introduction
Spring Batch Introduction
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
STAT Requirement Analysis
STAT Requirement AnalysisSTAT Requirement Analysis
STAT Requirement Analysis
 

Más de Jukka Zitting

Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
Jukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
Jukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
 

Más de Jukka Zitting (11)

Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Text and metadata extraction with Apache Tika
Text and metadata extraction with Apache TikaText and metadata extraction with Apache Tika
Text and metadata extraction with Apache Tika
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Mime Magic With Apache Tika