3. Hydra brings structure
What is unstructured data?
• A linguistic excuse?
News articles
Plain text that contains invaluable metadata for search, such as:
• Title
• Author byline
• Lead paragraph
4. Hydra is about your data
• Enrich your documents with metadata, to power your search
• Language
detec+on
• Sen+ment
analysis
• Headline
extrac+on
• Regular
expression
matching
and
extrac+on
• Filter out unwanted documents
• Collect statistics
• Export to Staging environments
8. Hydra Design Objectives
Scalability
• Possible to connect any number of processing machines
Fault tolerance
• Failiure of a stage affects only a single document
• Failiures can be automaticly detected
Robustness
• Stages and nodes are completely independent (no domino-
effect)
Development ease
• Allow test driven pipeline development
9. What about Hadoop and Big Data?
Usecases for document enrichment
• Pagerank
• Analy+cs
Hadoop & Map/Reduce advantages
• Huge
scalability
• Ability
to
work
on
en+re
document
set
at
once
Hadoop & Map/Reduce drawbacks
• Batch
processing
• Time-‐to-‐index
10. Hydra integrated with Hadoop
Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents
11. Hydra in summary
Hydra
• can chew through almost anything
• has many heads
• regenerates
• scales
12. Hydra is Open Source
• Other committers
• The role of Findwise
For more information:
• http://www.findwise.com/hydra
• http://findwise.github.com/Hydra
• Email: joel.westberg@findwise.com