Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

Large Scale Log Analysis with HBase and
Solr at Amadeus
Martin Alig
aligma@student.ethz.ch

Overview
 Problem
 Solution - Overview
 HBase
 Solr
 Solution - Details
 Results

Montag, 16. Juli 2012 2

Problem

 Amadeus is the worlds leading technology provider
to the travel industry, providing marketing,
distribution and IT services worldwide.
 The Amadeus computer reservation system (CRS)
processed 850 million billable travel transactions in
2010.
 Current logging framework produces 100'000 -
1'000'000 messages per second


Problem - Log Messages

 Messages with 1 KB average size
 Message can be anything: XML, Edifact, HEX
dump, ...
 A few fixed attributes per message given:
Timestamp, source, various ids.


Problem - Current Solution

 Write log messages in plain text files.
 Split, compress and copy to SAN.

Queries? Search? Statistics?


Solution Overview

 Use Apache HBase for storage and instant random
access
 Apache MapReduce for complex queries.
 Apache Solr as full text search engine for queries on
the log messages.


Apache HBase

 Open source, non-relational, distributed database.
 Modeled after Google's BigTable
 Runs on top of Hadoop Distributed Filesystem
(HDFS)


HBase - Terms

 Region
 Contigous ranges of rows stored together
 Dynamically split / merged and distributed
 RegionServer (slave)
 Serves regions, e.g. data for reads and writes
 HMaster (master)
 Responsible for coordination
 Assigns regions to Region Servers, detects failures
 Admin functions


HBase - Architecture

ZooKeeper
HMaster
Client ZooKeeper
HMaster
ZooKeeper

RegionServer RegionServer RegionServer

HDFS


HBase - Data Access

 Java API
 REST
 Apache Avro, Apache Thrift
 Hadoop MapReduce


HBase - Secondary Indexes

 No native support for secondary indexes
 Different choices:
 Client managed: Write value in data table and index in
index table
 Coprocessors that automatically create the secondary index
 Periodic update: Use MapReduce job to add index


HBase - Coprocessors

 Run arbitrary code on any node:
 Observer: RegionObserver, MasterObserver, WALObserver
provide hooks for code execution
(prePut, postPut, preGet, postGet, ...)
 Endpoint: Installed on nodes, executed on client request


Apache Solr

 Apache Lucene + many features like
 Distributed index
 Distributed search
 ...
 Apache Lucene is a high-performance, full-featured
text search engine library


Solution - Details

Client Insert log messages, create
secondary indexes for
predefinded attributes.

HBase

Use coprocessor functionality to index
log messages in Solr after insert.

Solr


Solution - Cluster Configuration
Client Zookeeper Namenode
SecondaryNamenode
HMaster

DataNode DataNode DataNode
RegionServer RegionServer RegionServer
Solr Solr Solr
...

Solution - HBase & MapReduce

 Very good integration of MapReduce into HBase
 Easy to use HBase as data source, data sink or both
 Provides helper classes


Solution - Problems

 Can Solr keep up with HBase?
 Is Solr full text search practical for log messages?
(XML, other formats, ...)


Results

 Not many, yet.
 Generic experiments with random data
 Experiments with real log data just started


Results - Write Random Data - HBase
Only
 Insert random data, 1KB records.
 Cluster configuration:
 5 Nodes:
 RAM: 24 GiB
 CPU: Intel Xeon L5520 2.26
 HD: 2x 15k RPM Sas 73 GB (RAID1)
 1. Node: Master (Namenode, HMaster, Zookeeper)
 2. - 5. Node: Slaves (Datanode, RegionServer)
 Client on seperate node
 Experiment executed with and without secondary
indexes. (5 additional indexes)


Results - Write Random Data - HBase
Only

No secondary indexes Secondary indexs
avg. inserts/sec avg. inserts/sec (not counting
index inserts
~30'000 ~6'000


Results - Write Read Data - HBase & Solr

 No real numbers
 First tests: Single Solr instance indexes ~1000 log
messages per second.


Questions


HBase - Architecure

Source: HBase - The Definitive Guide

HBase - Key Design

Source: HBase - The Definitive Guide

HBase - Hardware

 Master
 Ram: 24 GB
 CPU: Dual quad-core
 Disks: 4 x 1 TB SATA, RAID 0+1
 Slave
 Ram: 24 GB or more
 CPU: Dual quad-core
 Disks: 6 + 1 TB SATA, JBOD


HBase - Monitoring

 Ganglia is a scalable distributed monitoring system
for high-performance computing systems such as
clusters and Grids.
 HBase provides metrics for Ganglia.


Log Message Example (1)

2012/05/15 04:33:04.783757 sitst201 srvT2M-838059 Trace
name: all0302
Message sent [con=19104962 (FE_EXT_TCIL-ISO9735_ETK-
310_OPK2_ETK-REQ), cxn=1498840662
(172.17.39.174:13101), addr=0x1db58830, len=354,
CorrID=000100E1A1EU42,
MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]
UNB^]IATB^_1^]1AETH^_^_LY^]CDBETICKET^_^_LY^]1205
15^_0433^]00JNQPH79K0001^]^]^]O^UNH^]1^]TKCREQ^
_08^_5^_1A^]000100E1A1EU42^DCX^]134^]<DCC
VERS="1.0"><MW><UKEY VAL="EXRU$3013#GJ12V4K#1IZ"
TRXNB="1"/><$



2012/05/15 04:33:04.783671 sitst201 srvT2M-838059 Trace
name: all0302
Query [SAP=1ASICDBETK, DCXID=EXRU$3013#GJ12V4K#1IZ,
TRXNB=1, CorrID=000100E1A1EU42,
MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]



2012/05/15 04:32:42.289282 sitmt301 muxT2-332108 Trace
name: all0302
Message received [con=17697 (inSrvT2_TCIL_1),
cxn=1626671045 (194.156.170.210:8000),
addr=0x13e9b830, len=1710, CorrID=09B5840E,
MsgID=OX7E09RYABBLS61HR2DXTL]
+----- ADDR -----+--------------- HEX ---------------+----- ASCII ----
+---- EBCDIC ----+
0000000013e9b830 554e421d 49415442 1f311d31
4153494c UNB.IATB.1.1ASIL .+.............<
0000000013e9b840 53533243 53544e1d 3141304c
53534352 SS2CSTN.1A0LSSCR ......+....<....
0000000013e9b850 591d3132 30353135 1f303433 321d3030
Y.120515.0432.00 ................ 0000000013e9b860 39 ...

Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

Similar a Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich) (20)

Más de Swiss Big Data User Group

Más de Swiss Big Data User Group (20)

Último

Último (20)

Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)