SlideShare una empresa de Scribd logo
1 de 30
Large Scale Log Analysis with HBase and
Solr at Amadeus
Martin Alig
aligma@student.ethz.ch
Overview
     Problem
     Solution - Overview
     HBase
     Solr
     Solution - Details
     Results




Montag, 16. Juli 2012       2
Problem

 Amadeus is the worlds leading technology provider
  to the travel industry, providing marketing,
  distribution and IT services worldwide.
 The Amadeus computer reservation system (CRS)
  processed 850 million billable travel transactions in
  2010.
 Current logging framework produces 100'000 -
  1'000'000 messages per second




Montag, 16. Juli 2012                                     3
Problem - Log Messages

 Messages with 1 KB average size
 Message can be anything: XML, Edifact, HEX
  dump, ...
 A few fixed attributes per message given:
  Timestamp, source, various ids.




Montag, 16. Juli 2012                          4
Problem - Current Solution

 Write log messages in plain text files.
 Split, compress and copy to SAN.




      Queries? Search? Statistics?




Montag, 16. Juli 2012                       5
Solution Overview

 Use Apache HBase for storage and instant random
  access
 Apache MapReduce for complex queries.
 Apache Solr as full text search engine for queries on
  the log messages.




Montag, 16. Juli 2012                                 6
Apache HBase

 Open source, non-relational, distributed database.
 Modeled after Google's BigTable
 Runs on top of Hadoop Distributed Filesystem
  (HDFS)




Montag, 16. Juli 2012                                  7
HBase - Terms

 Region
        Contigous ranges of rows stored together
        Dynamically split / merged and distributed
 RegionServer (slave)
        Serves regions, e.g. data for reads and writes
 HMaster (master)
        Responsible for coordination
        Assigns regions to Region Servers, detects failures
        Admin functions




Montag, 16. Juli 2012                                          8
HBase - Architecture

                         ZooKeeper
                                         HMaster
       Client            ZooKeeper
                                         HMaster
                         ZooKeeper




       RegionServer     RegionServer   RegionServer



                           HDFS



Montag, 16. Juli 2012                                 9
HBase - Data Access

     Java API
     REST
     Apache Avro, Apache Thrift
     Hadoop MapReduce




Montag, 16. Juli 2012              10
HBase - Secondary Indexes

 No native support for secondary indexes
 Different choices:
        Client managed: Write value in data table and index in
         index table
        Coprocessors that automatically create the secondary index
        Periodic update: Use MapReduce job to add index




Montag, 16. Juli 2012                                             11
HBase - Coprocessors

 Run arbitrary code on any node:
        Observer: RegionObserver, MasterObserver, WALObserver
         provide hooks for code execution
         (prePut, postPut, preGet, postGet, ...)
        Endpoint: Installed on nodes, executed on client request




Montag, 16. Juli 2012                                               12
Apache Solr

 Apache Lucene + many features like
        Distributed index
        Distributed search
        ...
 Apache Lucene is a high-performance, full-featured
  text search engine library




Montag, 16. Juli 2012                                  13
Solution - Details

                                 Client        Insert log messages, create
                                               secondary indexes for
                                               predefinded attributes.



                                 HBase


                        Use coprocessor functionality to index
                        log messages in Solr after insert.



                                  Solr


Montag, 16. Juli 2012                                                 14
Solution - Cluster Configuration
              Client         Zookeeper           Namenode
                                             SecondaryNamenode
                                                  HMaster




           DataNode      DataNode                    DataNode
       RegionServer     RegionServer               RegionServer
                 Solr       Solr                       Solr
                                       ...
Montag, 16. Juli 2012                                             15
Solution - HBase & MapReduce

 Very good integration of MapReduce into HBase
 Easy to use HBase as data source, data sink or both
 Provides helper classes




Montag, 16. Juli 2012                               16
Solution - Problems

 Can Solr keep up with HBase?
 Is Solr full text search practical for log messages?
  (XML, other formats, ...)




Montag, 16. Juli 2012                                    17
Results

 Not many, yet.
 Generic experiments with random data
 Experiments with real log data just started




Montag, 16. Juli 2012                           18
Results - Write Random Data - HBase
Only
 Insert random data, 1KB records.
 Cluster configuration:
        5 Nodes:
               RAM: 24 GiB
               CPU: Intel Xeon L5520 2.26
               HD: 2x 15k RPM Sas 73 GB (RAID1)
        1. Node: Master (Namenode, HMaster, Zookeeper)
        2. - 5. Node: Slaves (Datanode, RegionServer)
 Client on seperate node
 Experiment executed with and without secondary
  indexes. (5 additional indexes)


Montag, 16. Juli 2012                                     19
Results - Write Random Data - HBase
Only

                   No secondary indexes   Secondary indexs
                   avg. inserts/sec       avg. inserts/sec (not counting
                                          index inserts
                    ~30'000                ~6'000




Montag, 16. Juli 2012                                                      20
Results - Write Read Data - HBase & Solr

 No real numbers
 First tests: Single Solr instance indexes ~1000 log
  messages per second.




Montag, 16. Juli 2012                                   21
Questions




Montag, 16. Juli 2012   22
Montag, 16. Juli 2012   23
HBase - Architecure




                        Source: HBase - The Definitive Guide
Montag, 16. Juli 2012                                    24
HBase - Key Design




                        Source: HBase - The Definitive Guide
Montag, 16. Juli 2012                                    25
HBase - Hardware

 Master
        Ram: 24 GB
        CPU: Dual quad-core
        Disks: 4 x 1 TB SATA, RAID 0+1
 Slave
        Ram: 24 GB or more
        CPU: Dual quad-core
        Disks: 6 + 1 TB SATA, JBOD




Montag, 16. Juli 2012                     26
HBase - Monitoring

 Ganglia is a scalable distributed monitoring system
  for high-performance computing systems such as
  clusters and Grids.
 HBase provides metrics for Ganglia.




Montag, 16. Juli 2012                                   27
Log Message Example (1)

      2012/05/15 04:33:04.783757 sitst201 srvT2M-838059 Trace
      name: all0302
      Message sent [con=19104962 (FE_EXT_TCIL-ISO9735_ETK-
      310_OPK2_ETK-REQ), cxn=1498840662
      (172.17.39.174:13101), addr=0x1db58830, len=354,
      CorrID=000100E1A1EU42,
      MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]
      UNB^]IATB^_1^]1AETH^_^_LY^]CDBETICKET^_^_LY^]1205
      15^_0433^]00JNQPH79K0001^]^]^]O^UNH^]1^]TKCREQ^
      _08^_5^_1A^]000100E1A1EU42^DCX^]134^]<DCC
      VERS="1.0"><MW><UKEY VAL="EXRU$3013#GJ12V4K#1IZ"
      TRXNB="1"/><$



Montag, 16. Juli 2012                                       28
Log Message Example (2)

      2012/05/15 04:33:04.783671 sitst201 srvT2M-838059 Trace
      name: all0302
      Query [SAP=1ASICDBETK, DCXID=EXRU$3013#GJ12V4K#1IZ,
      TRXNB=1, CorrID=000100E1A1EU42,
      MsgID=SQ8ZK36LG3TJ12JE6XMU2O8]




Montag, 16. Juli 2012                                       29
Log Message Example (3)

      2012/05/15 04:32:42.289282 sitmt301 muxT2-332108 Trace
      name: all0302
      Message received [con=17697 (inSrvT2_TCIL_1),
      cxn=1626671045 (194.156.170.210:8000),
      addr=0x13e9b830, len=1710, CorrID=09B5840E,
      MsgID=OX7E09RYABBLS61HR2DXTL]
      +----- ADDR -----+--------------- HEX ---------------+----- ASCII ----
      +---- EBCDIC ----+
       0000000013e9b830 554e421d 49415442 1f311d31
      4153494c UNB.IATB.1.1ASIL .+.............<
      0000000013e9b840 53533243 53544e1d 3141304c
      53534352 SS2CSTN.1A0LSSCR ......+....<....
      0000000013e9b850 591d3132 30353135 1f303433 321d3030
      Y.120515.0432.00 ................ 0000000013e9b860 39 ...
Montag, 16. Juli 2012                                                     30

Más contenido relacionado

Destacado

Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416OpenStack Foundation
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionNGDATA
 
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...Sematext Group, Inc.
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
STAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureSTAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureGord Sissons
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB
 

Destacado (20)

Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416Pittaro open stackloganalysis_20130416
Pittaro open stackloganalysis_20130416
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Lily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC editionLily for the Bay Area HBase UG - NYC edition
Lily for the Bay Area HBase UG - NYC edition
 
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
STAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructureSTAC Summit 2014 - Building a multitenant Big Data infrastructure
STAC Summit 2014 - Building a multitenant Big Data infrastructure
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
 

Similar a Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldDean Wampler
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionTony Baer
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Christian Gügi
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialPHP Support
 
mar07-redis.pdf
mar07-redis.pdfmar07-redis.pdf
mar07-redis.pdfAnisSalhi3
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...Daniel M. Farrell
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchDaniel M. Farrell
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 

Similar a Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich) (20)

Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
Why Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data WorldWhy Scala Is Taking Over the Big Data World
Why Scala Is Taking Over the Big Data World
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or Question
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
 
Final proj
Final projFinal proj
Final proj
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Node Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js TutorialNode Js, AngularJs and Express Js Tutorial
Node Js, AngularJs and Express Js Tutorial
 
mar07-redis.pdf
mar07-redis.pdfmar07-redis.pdf
mar07-redis.pdf
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_search
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 

Más de Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

Más de Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zurich)

  • 1. Large Scale Log Analysis with HBase and Solr at Amadeus Martin Alig aligma@student.ethz.ch
  • 2. Overview  Problem  Solution - Overview  HBase  Solr  Solution - Details  Results Montag, 16. Juli 2012 2
  • 3. Problem  Amadeus is the worlds leading technology provider to the travel industry, providing marketing, distribution and IT services worldwide.  The Amadeus computer reservation system (CRS) processed 850 million billable travel transactions in 2010.  Current logging framework produces 100'000 - 1'000'000 messages per second Montag, 16. Juli 2012 3
  • 4. Problem - Log Messages  Messages with 1 KB average size  Message can be anything: XML, Edifact, HEX dump, ...  A few fixed attributes per message given: Timestamp, source, various ids. Montag, 16. Juli 2012 4
  • 5. Problem - Current Solution  Write log messages in plain text files.  Split, compress and copy to SAN. Queries? Search? Statistics? Montag, 16. Juli 2012 5
  • 6. Solution Overview  Use Apache HBase for storage and instant random access  Apache MapReduce for complex queries.  Apache Solr as full text search engine for queries on the log messages. Montag, 16. Juli 2012 6
  • 7. Apache HBase  Open source, non-relational, distributed database.  Modeled after Google's BigTable  Runs on top of Hadoop Distributed Filesystem (HDFS) Montag, 16. Juli 2012 7
  • 8. HBase - Terms  Region  Contigous ranges of rows stored together  Dynamically split / merged and distributed  RegionServer (slave)  Serves regions, e.g. data for reads and writes  HMaster (master)  Responsible for coordination  Assigns regions to Region Servers, detects failures  Admin functions Montag, 16. Juli 2012 8
  • 9. HBase - Architecture ZooKeeper HMaster Client ZooKeeper HMaster ZooKeeper RegionServer RegionServer RegionServer HDFS Montag, 16. Juli 2012 9
  • 10. HBase - Data Access  Java API  REST  Apache Avro, Apache Thrift  Hadoop MapReduce Montag, 16. Juli 2012 10
  • 11. HBase - Secondary Indexes  No native support for secondary indexes  Different choices:  Client managed: Write value in data table and index in index table  Coprocessors that automatically create the secondary index  Periodic update: Use MapReduce job to add index Montag, 16. Juli 2012 11
  • 12. HBase - Coprocessors  Run arbitrary code on any node:  Observer: RegionObserver, MasterObserver, WALObserver provide hooks for code execution (prePut, postPut, preGet, postGet, ...)  Endpoint: Installed on nodes, executed on client request Montag, 16. Juli 2012 12
  • 13. Apache Solr  Apache Lucene + many features like  Distributed index  Distributed search  ...  Apache Lucene is a high-performance, full-featured text search engine library Montag, 16. Juli 2012 13
  • 14. Solution - Details Client Insert log messages, create secondary indexes for predefinded attributes. HBase Use coprocessor functionality to index log messages in Solr after insert. Solr Montag, 16. Juli 2012 14
  • 15. Solution - Cluster Configuration Client Zookeeper Namenode SecondaryNamenode HMaster DataNode DataNode DataNode RegionServer RegionServer RegionServer Solr Solr Solr ... Montag, 16. Juli 2012 15
  • 16. Solution - HBase & MapReduce  Very good integration of MapReduce into HBase  Easy to use HBase as data source, data sink or both  Provides helper classes Montag, 16. Juli 2012 16
  • 17. Solution - Problems  Can Solr keep up with HBase?  Is Solr full text search practical for log messages? (XML, other formats, ...) Montag, 16. Juli 2012 17
  • 18. Results  Not many, yet.  Generic experiments with random data  Experiments with real log data just started Montag, 16. Juli 2012 18
  • 19. Results - Write Random Data - HBase Only  Insert random data, 1KB records.  Cluster configuration:  5 Nodes:  RAM: 24 GiB  CPU: Intel Xeon L5520 2.26  HD: 2x 15k RPM Sas 73 GB (RAID1)  1. Node: Master (Namenode, HMaster, Zookeeper)  2. - 5. Node: Slaves (Datanode, RegionServer)  Client on seperate node  Experiment executed with and without secondary indexes. (5 additional indexes) Montag, 16. Juli 2012 19
  • 20. Results - Write Random Data - HBase Only No secondary indexes Secondary indexs avg. inserts/sec avg. inserts/sec (not counting index inserts ~30'000 ~6'000 Montag, 16. Juli 2012 20
  • 21. Results - Write Read Data - HBase & Solr  No real numbers  First tests: Single Solr instance indexes ~1000 log messages per second. Montag, 16. Juli 2012 21
  • 23. Montag, 16. Juli 2012 23
  • 24. HBase - Architecure Source: HBase - The Definitive Guide Montag, 16. Juli 2012 24
  • 25. HBase - Key Design Source: HBase - The Definitive Guide Montag, 16. Juli 2012 25
  • 26. HBase - Hardware  Master  Ram: 24 GB  CPU: Dual quad-core  Disks: 4 x 1 TB SATA, RAID 0+1  Slave  Ram: 24 GB or more  CPU: Dual quad-core  Disks: 6 + 1 TB SATA, JBOD Montag, 16. Juli 2012 26
  • 27. HBase - Monitoring  Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.  HBase provides metrics for Ganglia. Montag, 16. Juli 2012 27
  • 28. Log Message Example (1) 2012/05/15 04:33:04.783757 sitst201 srvT2M-838059 Trace name: all0302 Message sent [con=19104962 (FE_EXT_TCIL-ISO9735_ETK- 310_OPK2_ETK-REQ), cxn=1498840662 (172.17.39.174:13101), addr=0x1db58830, len=354, CorrID=000100E1A1EU42, MsgID=SQ8ZK36LG3TJ12JE6XMU2O8] UNB^]IATB^_1^]1AETH^_^_LY^]CDBETICKET^_^_LY^]1205 15^_0433^]00JNQPH79K0001^]^]^]O^UNH^]1^]TKCREQ^ _08^_5^_1A^]000100E1A1EU42^DCX^]134^]<DCC VERS="1.0"><MW><UKEY VAL="EXRU$3013#GJ12V4K#1IZ" TRXNB="1"/><$ Montag, 16. Juli 2012 28
  • 29. Log Message Example (2) 2012/05/15 04:33:04.783671 sitst201 srvT2M-838059 Trace name: all0302 Query [SAP=1ASICDBETK, DCXID=EXRU$3013#GJ12V4K#1IZ, TRXNB=1, CorrID=000100E1A1EU42, MsgID=SQ8ZK36LG3TJ12JE6XMU2O8] Montag, 16. Juli 2012 29
  • 30. Log Message Example (3) 2012/05/15 04:32:42.289282 sitmt301 muxT2-332108 Trace name: all0302 Message received [con=17697 (inSrvT2_TCIL_1), cxn=1626671045 (194.156.170.210:8000), addr=0x13e9b830, len=1710, CorrID=09B5840E, MsgID=OX7E09RYABBLS61HR2DXTL] +----- ADDR -----+--------------- HEX ---------------+----- ASCII ---- +---- EBCDIC ----+ 0000000013e9b830 554e421d 49415442 1f311d31 4153494c UNB.IATB.1.1ASIL .+.............< 0000000013e9b840 53533243 53544e1d 3141304c 53534352 SS2CSTN.1A0LSSCR ......+....<.... 0000000013e9b850 591d3132 30353135 1f303433 321d3030 Y.120515.0432.00 ................ 0000000013e9b860 39 ... Montag, 16. Juli 2012 30