Presented at Lucene Revolution, 7-8 May in Boston and Berlin Buzzwords 4-5 June, 2012.
When working with free text search, the quality of the data in the index is a key factor on the quality of the results delivered and has a major impact on the information consumption experience. Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. Providing a scalable and efficient pipeline which the documents pass through before being indexed into the search engine does this.
2. About Findwise
• Founded in 2005
• Offices in Sweden, Denmark,
Norway and Poland
• 80 employees (April 2012)
• Our objective is to be a leading provider of Findability solutions utilising
the full potential of search technology to create customer business value
3.
4. Technology independent
Creating search-driven Findability solutions based on market-leading
commercial and open source search technology platforms:
Autonomy IDOL
Microsoft (SharePoint and FAST Search products)
Google GSA
IBM ICA/OmniFind
LucidWorks
Apache Lucene/Solr
Elastic Search
and more…
6. Connecting source to search
Garbage in, garbage out. But what about unstructured data in?
• Flat data is richer than it appears
• Don’t discard information too soon!
The unstructured structured data paradox
Example: News articles
Plain text that contains invaluable metadata for search, such as:
• Title
• Author byline
• Lead paragraph
7. Enrichment and structuring possibilities
• Enrich your documents with metadata, to power your search
• Language detection
• Sentiment analysis
• Headline extraction
• Regular expression matching and extraction
• Filter out unwanted documents
• Collect statistics
• Export to Staging environments
11. Main Design Objectives
Scalability
• Horizontally scalable central repository
• Independent processing nodes
Failiure tolerant
• Failiure of a stage affects only a single document
• Failiure of a node affects at most n documents
• Failiures can be automaticly detected
Robustness
• Independent stages
Development ease
• Debug stages from IDE against actual data
• Allow test driven pipeline development
13. Hadoop/Big Data integration
Usecases for document enrichment
• Pagerank
• Analytics
Hadoop & Map/Reduce advantages
• Huge scalability
• Ability to work on entire document set at once
Hadoop & Map/Reduce drawbacks
• Batch processing
• Time-to-index
14. Hadoop/Big Data integration
Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents
16. Open Source initiative
• Other committers
• The role of Findwise
For more information:
• http://www.findwise.com/hydra
• http://findwise.github.com/Hydra
• Email: joel.westberg@findwise.com
18. Thank Joel Westberg
joel.westberg@findwise.com
you!
Notas del editor
Findwise is a consultingcompany, focused on deliveringFindability solutions toourcustomers.
That’s where we come from, now let’s talk about whatLet’s start off with a view of how Apache Solr in the middle, Manifold CF on the far left
In examples: “Post the XML file to the DataImportHandler, and now you’ve indexed wikipedia!”In reality: “Here’s a jumbled heap of files, and in these documents when it says “product id: XYZ”, that’s the most important thing to index”
Today we use OpenPipeline.One stage might extract references from the text and store them in separate fieldsOne might extract entities from the textOne might be parsing the pdf itself.
Also forking is a problem etcetc
Core is the framework itself, e.g. “java –jar hydra-jar-with-dependencies.jar” to launch Hydra.
Light blue – all documents prior to being exported to hbasePurple – All documentsRed – all documents that have been reinserted from hbase