Adding Search to the Hadoop Ecosystem

1
Finding a needle in a stack of
needles - adding Search to the
Hadoop Ecosystem
Patrick Hunt (@phunt)
Bay Area Search Meetup, April 2014

Agenda
• Big Data and Search – setting the stage
• Cloudera Search’s Architecture
• Apache Lucene/Solr
• Apache Flume
• Apache HBase
• Apache MapReduce
• Apache Sentry
• Near Real Time and Batch Use Cases
• Conclusion and Q&A

Why Search?
An Integrated Framework
on Apache Hadoop
One pool of data
One security framework
One set of system resources
One management interface

Search Simplifies Interaction
• User Goals
• Explore
• Navigate
• Correlate
• Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!

What is Cloudera Search?
• Full-text, interactive search and faceted navigation
• Batch, near real-time, and on-demand indexing
• Apache Solr integrated with CDH
• Established, mature search with vibrant community
• Separate runtime like MapReduce, Impala
• Incorporated as part of the Apache Hadoop ecosystem
• Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs

Challenges
• Scalable/Reliable Index Storage
• Near Real Time (NRT) indexing
• Scalable Batch Indexing
• Usability

Apache Lucene/Solr
• Lucene - full text search library
• Solr – search service on Lucene
• SolrCloud – distributed search
• We are using version 4 (4.4 currently)

Integrate Solr/Lucene with HDFS
• Lucene Directory Abstraction
• Implemented HDFSDirectory using HDFS client library
• Read/Write index files directly to HDFS
• Solr DirectoryFactory Abstraction
• HDFSDirectoryFactory plugs HDFSDirectory into Solr
• Configuration – Solr and HDFS

Cloudera Upstream Contributions - Solr
• SOLR-3911 - Directory/DirectoryFactory now first class
• Solr Replication now uses Directory abstraction
• Solr Admin UI no longer assumes local directory access
• SOLR-4916 – support for reading/writing Solr index files and
transaction log files to/from HDFS
• HDFSDirectoryFactory/HDFSDirectory implementation
• SOLR-4655 - The Overseer should assign node names by default.
• SOLR-3706 - Ship setup to log with log4j
• SOLR-4494 - Clean up and polish Collections API
• SOLR-4718 -Improvements to configurability
• Configuration now entirely through ZooKeeper (optional)
• Many more improvements/cleanup/hardening/…

Distributed Search on Hadoop
Flume
Hue UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop Cluster
MR
HDFS
index
HBase
index

Near Real Time Indexing with Flume
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
11

Apache Flume - MorphlineSolrSink
• Flume – reliable/scalable log collection
• Created a Flume “Sink” for indexing events to Solr
• Integrates Cloudera Morphlines (ETL framework)

Near Real Time indexing of HBase
HDFS
HBase
interactiveload
Indexer(s)
Triggerson
updates
Solr server
Solr server
Solr server
Solr server
Solr server
Search
+ =
planet-sized tabular data
immediate access & updates
fast & flexible information
discovery
BIG DATA DATAMANAGEMENT

Lily HBase Indexer
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management platform
• Lily HBase Indexer
• Service which acts as a HBase replication listener
• Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata

Scalable Batch Indexing
Index
shard
Files
Index
shard
Indexer
Files
Solr
server
Solr
server
15
HDFS/HBase
Solr and MapReduce
• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, cost-
efficient re-indexingIndexer

Scalable Batch Indexing
16
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Mapper:
Parse input into
indexable document
Index
shard 1
Index
shard 2
Arbitrary reducing steps of indexing and merging
End-Reducer (shard 1):
Index document
End-Reducer (shard 2):
Index document

MapReduce Indexer
• MapReduce Job with two parts
1) Scan HDFS for files (or HBase for records) to be indexed
2) Mapper/Reducer indexing step
• Mapper extracts content via Cloudera Morphlines
• Reducer uses Lucene to index documents directly to HDFS
• “golive”
• Cloudera created this to bridge the gap between NRT (low
latency, expensive) and Batch (high latency, cheap at scale)
indexing
• Results of MR indexing operation are immediately merged into a
live SolrCloud serving cluster
• No downtime for users
• No NRT expense
• Linear scale out to the size of your MR cluster

Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language

Conclusion and Q&A
• Try it now with Cloudera Live!
• Cloudera Search
• Free Download
• Extensive documentation
• Send your questions and feedback to Cloudera Search
Forum
• Take the Search online training
• Cloudera Express (i.e. the free version)
• Simple management of Search
• Free Download

Adding Search to the Hadoop Ecosystem

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Adding Search to the Hadoop Ecosystem

Similar a Adding Search to the Hadoop Ecosystem (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Adding Search to the Hadoop Ecosystem