3. Problem
Amadeus is the worlds leading technology provider
to the travel industry, providing marketing,
distribution and IT services worldwide.
The Amadeus computer reservation system (CRS)
processed 850 million billable travel transactions in
2010.
Current logging framework produces 100'000 -
1'000'000 messages per second
Montag, 16. Juli 2012 3
4. Problem - Log Messages
Messages with 1 KB average size
Message can be anything: XML, Edifact, HEX
dump, ...
A few fixed attributes per message given:
Timestamp, source, various ids.
Montag, 16. Juli 2012 4
5. Problem - Current Solution
Write log messages in plain text files.
Split, compress and copy to SAN.
Queries? Search? Statistics?
Montag, 16. Juli 2012 5
6. Solution Overview
Use Apache HBase for storage and instant random
access
Apache MapReduce for complex queries.
Apache Solr as full text search engine for queries on
the log messages.
Montag, 16. Juli 2012 6
7. Apache HBase
Open source, non-relational, distributed database.
Modeled after Google's BigTable
Runs on top of Hadoop Distributed Filesystem
(HDFS)
Montag, 16. Juli 2012 7
8. HBase - Terms
Region
Contigous ranges of rows stored together
Dynamically split / merged and distributed
RegionServer (slave)
Serves regions, e.g. data for reads and writes
HMaster (master)
Responsible for coordination
Assigns regions to Region Servers, detects failures
Admin functions
Montag, 16. Juli 2012 8
10. HBase - Data Access
Java API
REST
Apache Avro, Apache Thrift
Hadoop MapReduce
Montag, 16. Juli 2012 10
11. HBase - Secondary Indexes
No native support for secondary indexes
Different choices:
Client managed: Write value in data table and index in
index table
Coprocessors that automatically create the secondary index
Periodic update: Use MapReduce job to add index
Montag, 16. Juli 2012 11
12. HBase - Coprocessors
Run arbitrary code on any node:
Observer: RegionObserver, MasterObserver, WALObserver
provide hooks for code execution
(prePut, postPut, preGet, postGet, ...)
Endpoint: Installed on nodes, executed on client request
Montag, 16. Juli 2012 12
13. Apache Solr
Apache Lucene + many features like
Distributed index
Distributed search
...
Apache Lucene is a high-performance, full-featured
text search engine library
Montag, 16. Juli 2012 13
14. Solution - Details
Client Insert log messages, create
secondary indexes for
predefinded attributes.
HBase
Use coprocessor functionality to index
log messages in Solr after insert.
Solr
Montag, 16. Juli 2012 14
16. Solution - HBase & MapReduce
Very good integration of MapReduce into HBase
Easy to use HBase as data source, data sink or both
Provides helper classes
Montag, 16. Juli 2012 16
17. Solution - Problems
Can Solr keep up with HBase?
Is Solr full text search practical for log messages?
(XML, other formats, ...)
Montag, 16. Juli 2012 17
18. Results
Not many, yet.
Generic experiments with random data
Experiments with real log data just started
Montag, 16. Juli 2012 18
19. Results - Write Random Data - HBase
Only
Insert random data, 1KB records.
Cluster configuration:
5 Nodes:
RAM: 24 GiB
CPU: Intel Xeon L5520 2.26
HD: 2x 15k RPM Sas 73 GB (RAID1)
1. Node: Master (Namenode, HMaster, Zookeeper)
2. - 5. Node: Slaves (Datanode, RegionServer)
Client on seperate node
Experiment executed with and without secondary
indexes. (5 additional indexes)
Montag, 16. Juli 2012 19
20. Results - Write Random Data - HBase
Only
No secondary indexes Secondary indexs
avg. inserts/sec avg. inserts/sec (not counting
index inserts
~30'000 ~6'000
Montag, 16. Juli 2012 20
21. Results - Write Read Data - HBase & Solr
No real numbers
First tests: Single Solr instance indexes ~1000 log
messages per second.
Montag, 16. Juli 2012 21
27. HBase - Monitoring
Ganglia is a scalable distributed monitoring system
for high-performance computing systems such as
clusters and Grids.
HBase provides metrics for Ganglia.
Montag, 16. Juli 2012 27