Blur - A Distributed Search Capability Built on Lucene and Hadoop

Blur - Lucene on Hadoop

Aaron McCurry

http://github.com/nearinfinity/blur

Aaron McCurry
•  Programming Java for 10+ years
•  Working with BigData for 3 years
•  Using Lucene for 3 years
•  Using Hadoop for 2 years
•  B.S. in Computer Engineering from Virginia Tech
•  Sr. Software Engineer at Near Infinity Corporation

Developing Blur for 1.5 years

Agenda
Definition of Blur and the key benefits

Blur architecture and its component parts

Query capabilities

Challenges that Blur had to overcome

What is Blur?

A distributed search capability built on top of Hadoop and Lucene

•  Built specifically for Big Data
•  Scalability, redundancy, and performance baked in from the start
•  Leverages all the goodness built into the Hadoop and Lucene stack

Blur uses the Apache 2.0 license

Why should I use Blur?
Benefits Description

Store, index and search massive amounts of
Scalable
data
Performance similar to standard Lucene
Fast
implementation
Stores data updates in write ahead log
Durable
(WAL) in case of node failure
Auto-detects node failure and re-assigns
Failover
indexes to surviving nodes
Provides all the standard Lucene queries
Query flexibility
plus joining data

Blur Data Model

Blur stores information in Tables that contain Rows
Rows contain Records
Records exist in column families (Used for grouping information)
Records contain Columns
Columns contain a name / value pairing (Stored as Strings)

NOTE: Columns with the same name can exist in the same Record

Blur Data Model in JSON
{
rowid : "user_a@server.com",
records : [
{
recordid : "324182347",
family : "messages",
columns : [
{ name : "to", value : "user_b@server.com" },
{ name : "to", value : "user_c@server.com" },
{ name : "subject", value : "important!" },
{ name : "body", value : "This is a very important email...." }
] }, {
recordid : "234123412",
family : "contacts",
columns : [
{ name : "name",value:"Jon Doe" },
{ name : "email",value:"user_d@server.com" }
]
}
]
}

Blur Data Model
Table

RowID = user_a@server.com

RecordID = 324182347 RecordID = 234123412

family = messages family = contacts
to: user_b@server.com

name: Jon Doe

to: user_c@server.com

email: user_d@server.com

subject: important!

body: This is a...

RowID = bob_smith@yahoo.com

...

Blur Architecture
Component Purpose

Lucene Perform actual search duties

HDFS Store Lucene indexes

MapReduce Uses Hadoop s MR to index data

Thrift All inter-process communications

ZooKeeper Manages system state and stores metadata

Blur uses Two Types of Server Processes

Orchestrates communication
between all of the shard
servers for queries
Uses: HDFS, Thrift and Zookeeper

Responsible for performing
searches for each shard and
returns results to controller
Uses: Same as controller plus Lucene

Why Lucene for Search?
Key Benefit

Stable, performant with robust features like
Features
NRT, GIS, new Analyzers, etc.

Seems like everyone is using it -- Lucene
Adoption
directly, Solr, Elastic Search

Community Very active open source project

API Easy to extend (analyzers, directories, etc.)

Levenshtein Automaton (4.0), Flexible
Future
indexing (4.0)

HDFS for Storage
Index data is stored in HDFS

Data updates are written to a Write
Ahead Log (WAL) before being indexed
into the appropriate Lucene index

Sync is called on the WAL before the
call returns for durability (can be
disabled during mutation calls)

Zookeeper for Meta Data and State
Shard server state is stored in
Zookeeper

Table meta data along with
Lucene writer locks are stored
under the table node

All online controllers are also
registered in Zookeeper

Blur Query
Blur uses the standard Lucene query syntax
messages.to:(+jon +doe)

Blur also allows for cross column family intersection queries
+messages.to:(+joe +doe) +contact.name:bill
Which in effect gives you a join like query, because messages
and contacts are stored in different Records.
find messages where the message was sent to Joe and Doe
and where the user has a contact named Bill
Blur supports any Lucene query (limited to Java clients only)

Challenges that Blur Solves

Reindexing of Datasets
Random Access Writes with Lucene
Random Access Latency with HDFS

JVM GC - LUCENE-2205 Lucene low memory patch


Problem:
To be able to reindex all of the data whenever needed as fast as
possible without affecting performance of existing online datasets.


Solution:
MapReduce to the rescue, Blur uses Hadoop to build the indexes and
deliver them to the HDFS. This allows the very CPU and I/O intensive
computations to occur on the Hadoop cluster where you probably have
the most computing resources.
The delivery of the indexes can be controlled to reduce I/O impact to
the running systems. Also the indexes can be delivered to different
HDFS instances for total separation of I/O.

Random Access Writes w/Lucene

Problem:
Writes in Lucene and in HDFS share a common trait in that once a file
is closed for writing is it never modified, immutable. However Lucene
requires random access writes while the file is open for writes and
HDFS cannot natively support this type of operation.

Random Access Writes w/Lucene

Solution:
A logically layered file that writes only appends. While a file is open for
writes and when a seek is called (that actually needs to move to a new
position), a new logical block is created that stores the logical position
of the data, the real position of the data on disk and the length of the
data.
When the file is opened for reading, the meta data about the logical
blocks are read into memory and used during reads to calculate the
real position of the data requested.

Random Access Latency w/HDFS

Problem:
HDFS is not very good at random access reads. Great improvements
have been made and more are coming in 0.23, but it still won t be
enough to support low latency Lucene accesses. Lucene relies on file
system caching or MMAP of index files for performance when
executing queries on a single machine with a normal OS file system.

Random Access Latency w/HDFS

Solution:
Add a Lucene Directory level block cache to store the hot blocks from
the files that Lucene uses for searching. A concurrent LRU map stores
the location of the blocks in pre allocated slabs of memory. The slabs
of memory are allocated at startup and in essence are used in place of
a OS filesystem cache.
A side benefit to this feature is that writing new data to the HDFS
instance does not unload the hot blocks from memory. For example if
a new table of data is being written to HDFS from a MapReduce job.

Future

Blur is a new project which means that there is a lot of work that needs
to be done to make sure it is ready for "web scale", but I believe is can
be done.
Future Tasks:
More Performance Tuning
More Tests
More Documentation
Native GIS Queries
Incremental Updates from MapReduce
Index Splits

Questions?
Blur:
http://github.com/nearinfinity/blur

Blur 0.1.rc1 now available

Blog:
http://www.nearinfinity.com/blogs/aaron_mccurry/

Blur - A Distributed Search Capability Built on Lucene and Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Blur - A Distributed Search Capability Built on Lucene and Hadoop

Similar a Blur - A Distributed Search Capability Built on Lucene and Hadoop (20)

Más de Yahoo Developer Network

Más de Yahoo Developer Network (20)

Último

Último (20)

Blur - A Distributed Search Capability Built on Lucene and Hadoop