Blur is a distributed search capability built on top of Hadoop and Lucene that is designed specifically for big data. It leverages the scalability, redundancy, and performance of Hadoop and Lucene. Blur stores data in tables containing rows and records with columns, and uses MapReduce to index data and shard servers to perform searches in a scalable and fault-tolerant manner. It overcomes challenges like reindexing large datasets and providing low-latency random access by leveraging features of its architecture. Future work includes more performance tuning, testing, documentation, and new query capabilities.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Blur - A Distributed Search Capability Built on Lucene and Hadoop
1. Blur - Lucene on Hadoop
Aaron McCurry
http://github.com/nearinfinity/blur
2. Aaron McCurry
• Programming Java for 10+ years
• Working with BigData for 3 years
• Using Lucene for 3 years
• Using Hadoop for 2 years
• B.S. in Computer Engineering from Virginia Tech
• Sr. Software Engineer at Near Infinity Corporation
Developing Blur for 1.5 years
3. Agenda
Definition of Blur and the key benefits
Blur architecture and its component parts
Query capabilities
Challenges that Blur had to overcome
4. What is Blur?
A distributed search capability built on top of Hadoop and Lucene
• Built specifically for Big Data
• Scalability, redundancy, and performance baked in from the start
• Leverages all the goodness built into the Hadoop and Lucene stack
Blur uses the Apache 2.0 license
5. Why should I use Blur?
Benefits Description
Store, index and search massive amounts of
Scalable
data
Performance similar to standard Lucene
Fast
implementation
Stores data updates in write ahead log
Durable
(WAL) in case of node failure
Auto-detects node failure and re-assigns
Failover
indexes to surviving nodes
Provides all the standard Lucene queries
Query flexibility
plus joining data
6. Blur Data Model
Blur stores information in Tables that contain Rows
Rows contain Records
Records exist in column families (Used for grouping information)
Records contain Columns
Columns contain a name / value pairing (Stored as Strings)
NOTE: Columns with the same name can exist in the same Record
7. Blur Data Model in JSON
{
rowid : "user_a@server.com",
records : [
{
recordid : "324182347",
family : "messages",
columns : [
{ name : "to", value : "user_b@server.com" },
{ name : "to", value : "user_c@server.com" },
{ name : "subject", value : "important!" },
{ name : "body", value : "This is a very important email...." }
] }, {
recordid : "234123412",
family : "contacts",
columns : [
{ name : "name",value:"Jon Doe" },
{ name : "email",value:"user_d@server.com" }
]
}
]
}
8. Blur Data Model
Table
RowID = user_a@server.com
RecordID = 324182347 RecordID = 234123412
family = messages family = contacts
to: user_b@server.com
name: Jon Doe
to: user_c@server.com
email: user_d@server.com
subject: important!
body: This is a...
RowID = bob_smith@yahoo.com
...
9. Blur Architecture
Component Purpose
Lucene Perform actual search duties
HDFS Store Lucene indexes
MapReduce Uses Hadoop s MR to index data
Thrift All inter-process communications
ZooKeeper Manages system state and stores metadata
10. Blur uses Two Types of Server Processes
Orchestrates communication
between all of the shard
servers for queries
Uses: HDFS, Thrift and Zookeeper
Responsible for performing
searches for each shard and
returns results to controller
Uses: Same as controller plus Lucene
12. Why Lucene for Search?
Key Benefit
Stable, performant with robust features like
Features
NRT, GIS, new Analyzers, etc.
Seems like everyone is using it -- Lucene
Adoption
directly, Solr, Elastic Search
Community Very active open source project
API Easy to extend (analyzers, directories, etc.)
Levenshtein Automaton (4.0), Flexible
Future
indexing (4.0)
13. HDFS for Storage
Index data is stored in HDFS
Data updates are written to a Write
Ahead Log (WAL) before being indexed
into the appropriate Lucene index
Sync is called on the WAL before the
call returns for durability (can be
disabled during mutation calls)
14. Zookeeper for Meta Data and State
Shard server state is stored in
Zookeeper
Table meta data along with
Lucene writer locks are stored
under the table node
All online controllers are also
registered in Zookeeper
15. Blur Query
Blur uses the standard Lucene query syntax
messages.to:(+jon +doe)
Blur also allows for cross column family intersection queries
+messages.to:(+joe +doe) +contact.name:bill
Which in effect gives you a join like query, because messages
and contacts are stored in different Records.
find messages where the message was sent to Joe and Doe
and where the user has a contact named Bill
Blur supports any Lucene query (limited to Java clients only)
16. Challenges that Blur Solves
Reindexing of Datasets
Random Access Writes with Lucene
Random Access Latency with HDFS
JVM GC - LUCENE-2205 Lucene low memory patch
17. Reindexing of Datasets
Problem:
To be able to reindex all of the data whenever needed as fast as
possible without affecting performance of existing online datasets.
18. Reindexing of Datasets
Solution:
MapReduce to the rescue, Blur uses Hadoop to build the indexes and
deliver them to the HDFS. This allows the very CPU and I/O intensive
computations to occur on the Hadoop cluster where you probably have
the most computing resources.
The delivery of the indexes can be controlled to reduce I/O impact to
the running systems. Also the indexes can be delivered to different
HDFS instances for total separation of I/O.
19. Random Access Writes w/Lucene
Problem:
Writes in Lucene and in HDFS share a common trait in that once a file
is closed for writing is it never modified, immutable. However Lucene
requires random access writes while the file is open for writes and
HDFS cannot natively support this type of operation.
20. Random Access Writes w/Lucene
Solution:
A logically layered file that writes only appends. While a file is open for
writes and when a seek is called (that actually needs to move to a new
position), a new logical block is created that stores the logical position
of the data, the real position of the data on disk and the length of the
data.
When the file is opened for reading, the meta data about the logical
blocks are read into memory and used during reads to calculate the
real position of the data requested.
21. Random Access Latency w/HDFS
Problem:
HDFS is not very good at random access reads. Great improvements
have been made and more are coming in 0.23, but it still won t be
enough to support low latency Lucene accesses. Lucene relies on file
system caching or MMAP of index files for performance when
executing queries on a single machine with a normal OS file system.
22. Random Access Latency w/HDFS
Solution:
Add a Lucene Directory level block cache to store the hot blocks from
the files that Lucene uses for searching. A concurrent LRU map stores
the location of the blocks in pre allocated slabs of memory. The slabs
of memory are allocated at startup and in essence are used in place of
a OS filesystem cache.
A side benefit to this feature is that writing new data to the HDFS
instance does not unload the hot blocks from memory. For example if
a new table of data is being written to HDFS from a MapReduce job.
23. Future
Blur is a new project which means that there is a lot of work that needs
to be done to make sure it is ready for "web scale", but I believe is can
be done.
Future Tasks:
More Performance Tuning
More Tests
More Documentation
Native GIS Queries
Incremental Updates from MapReduce
Index Splits
24. Questions?
Blur:
http://github.com/nearinfinity/blur
Blur 0.1.rc1 now available
Blog:
http://www.nearinfinity.com/blogs/aaron_mccurry/