The document discusses capacity planning and performance tuning for Hadoop big data systems. It begins with an agenda that covers why capacity planners need to prepare for Hadoop, an overview of the Hadoop ecosystem, capacity planning and performance tuning of Hadoop, getting started, and the importance of measurement. The document then discusses various components of the Hadoop ecosystem and provides guidance on analyzing different types of workloads and components.
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner
1. moviri.com
Hitchhiker’s guide for the Capacity Planner
Connecticut Computer Measurement Group
Connecticut Computer Measurement Group
Cromwell CT – April 2015
Renato Bonomini renato.bonomini@moviri.com
Capacity Management and BigData
2. 2
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
3. Brought to you by…
Renato Bonomini
Lead of US operations
for Moviri
@renatobonomini
Mattia Berlusconi
Capacity
Management
Consultant
Giulia Rumi
Capacity
Management
Analyst
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 3
4. 4
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
5. Handling large amount of data?High Performance Computing?
Is it new? Where does it come from? Why do I have to listen to this?
5
Cray 1, 80 MFLOPS, 1975
[A bunch of engineers on a field trip in Silicon Valley, Renato]
IBM 350, 3.56 Mb, 1956
[Wikipedia]
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
6. “When will computer
hardware match the
human brain?”
Hans Moravec
Robotics Institute
Carnegie Mellon
University
The need for
Analytics:
the new
“machine
revolution”
6
http://www.transhumanist.com/volume1/moravec.htm
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
7. February 20151964: Isaac Asimov on 2014 World’s Fair
“The world of A.D. 2014 will have
few routine jobs that cannot be
done better by some machine than
by any human being.
Mankind will therefore have
become largely a race of machine
tenders.”
“When will computer hardware match the human brain?”
7
http://reuvengorsht.com/2015/02/07/machines-replace-middle-management
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
8. “Map and Reduce”“Divide et Impera”
• Julius Caesar arrives in Alexandria after
defeating the Egyptian army and enters the
Ancient Library
• Surprise: there are millions of copies in the
library, how many of those are in latin?
• Caesar arranges a Centuria (80 soldiers) to
inspect each one a batch of books and report
to their Centurion the number of pages
written in Latin for their book
• The Centurion writes on a tabula the count
from each soldier; when finished he sums the
part up
All I need to know I learned from Rome
8
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
9. “Map and Reduce”Message Passing Interface
Wow so “Map and Reduce” was a revolution? In one sense, which one?
9
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
MPI tutorial Blaise Barney
Lawrence Livermore National Laboratory
C, Fortran
Java, Python
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
10. 1. MapReduce makes technologies available to a wide audience
We saw that MPI already handled similar use cases, but it was restricted mostly to University
Research and large R&D facilities
2. Reliability and commodity hardware at its base
3. It moves the needle on how to handle large amount of data
Database: organize first, then load
Hadoop: load first, then organize
What are the revolutions brought by MapReduce and BigData?
10Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
11. 11
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
12. ● “Hardware” contains libraries and utilities, stores data, and supports
jobs execution
● HDFS is the fault-tolerant, replicated distributed file-system
● YARN (Yet Another Resource Negotiator) includes several
programming models that can co-exist in the cluster and MapReduce
is only one of them
● The Application layer is composed of several frameworks, among
which Pig and Hive are the most used.
Hadoop workflow
● clients break data into small chunks to be loaded onto different data
nodes
● for each datablock, client contacts namenode and it answer with a
sorted list of 3 data nodes (every block is replicated in more than one
machine)
● the client writes the blocks directly onto the datanode, the datanode
replicates the data onto the two nodes
The most famous open-source implementation of a MapReduce
framework is Apache Hadoop
12
Optimization Techniques within the Hadoop Eco-system: a Survey
Giulia Rumi, Claudia Colella, Danilo Ardagna
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
LAYERS HADOOP 1.X HADOOP 2.X
Users
Application
layer
Programming
Models
Resource
Management
File system
Hardware
Hive/Pig
Hadoop 1.X
MapReduce
HDFS
Hive/Pig
HDFS
YARN
MapReduce
14. Geek Fun
“A DBA walks into a NOSQL bar,
but turns and leaves
because he couldn't find a table”
(webtonull)
15. ● HDFS (Hadoop distributed filesystem) is
where Hadoop cluster stores data
● YARN is the architectural center of Hadoop
that allows multiple data processing engines
● MapReduce is a programming paradigm
● Hive provides a warehouse structure and
SQL-like access for data in HDFS
● Pig A high-level data-flow language
● Hbase is an open-source, distributed,
versioned, column-oriented store that sits
on top of HDFS.
• Apache Spark is an open source big data real
time processing framework
• ZooKeeper is an open source Apache project
that provides a centralized infrastructure and
services that enable synchronization across a
cluster
• Apache Cassandra is an open source
distributed database management system
designed to handle large amounts of data
across many commodity servers
• Solr is an opensource enterprise search
platform from the Apache Lucene project. It
provides full-text search, hit highlighting,
faceted search, dynamic clustering, database
integration and rich document handling.
We are going to focus on a few specific “animals” of this zoo
15Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
16. 16
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
17. “we’ll start the new Hadoop cluster with 500 TB and then we’ll see how much we need”
Real conversation at customer
Why do you need to get on board soon?
There are significant resources and areas of improvement
● Significant investments are being directed towards these
initiatives
● They are complex, large, with hundredths of configuration
parameters: a little help from experienced capacity
planner can save a lot of money
17Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
18. ● Shouldn’t the ‘Hadoop user/owner’ take care of this?
Distributed machine learning is still an active research topic, It is related to both
machine learning and systems
While Hadoop users don’t develop systems, they need to know how to choose
systems. An important fact is that existing distributed systems or parallel
frameworks are not particularly designed for machine learning algorithms
● Hadoop users can
help to affect how systems are designed
design new algorithms for existing systems
Role of the Capacity Planner and Performance Analyst
18Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
20. ● Scheduling is one of the most important tasks in a multi-
concurrent-task system: research from our colleague
Giulia (and others) on “Optimization Techniques within
the Hadoop Eco-system: a Survey” [DOI: 10.1109/SYNASC.2014.65]
● This illustrates the typical optimization problems:
data locality
sticky slots problems
poor system utilization because of suboptimal distribution
of tasks
unbalanced jobs
starvation and even fairness (be fair to your users)
● There are hundredths of configuration variables available
to the end-user: rule of thumb vs. optimal configuration
can make a big difference
Current performance tuning opportunities:
Scheduling
20Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
21. ● Other initiatives
starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of
MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA
shivnath@cs.duke.edu]
Research from Dominique A. Heger of DHT [Workload Dependent Hadoop MapReduce Application
Performance Modeling]
● The common result of most research initiatives is “One size does not fit all”
Example for classic MapReduce: there is not a single behavior, you have to know your workload
characterization
“Hortonworks recommends that you either use the Balanced workload configuration or invest in a
pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment”
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-
guide/content/typical-workloads.html
How are other configuration opportunities being pursued?
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 21
22. ● You want to know what is the limiting
factor of each workload
● Examples are
CPU performance
Disk I/O
Memory (bandwidth and latency)
Network (bandwidth, delay, packet loss)
storage space
● This is nothing new for the
wise Capacity Planner!
Profiling your workload
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 22
Courtesy of Intel
23. 23
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
24. Different point of views for analysis
• Interest in fast response for
“interactive workload”
– CPU, Memory, Network and IO utilization
levels to respond to queries in a quick and
effective way
• Interest in high throughput for
“batch workloads”
– Maximize the utilization levels, not interested
in response time
• Interest in storage capacity
– Understand and plan file system and HDFS
Different types of Workload
• Most companies are simply using Hadoop to
store information (HDFS) for big data-sets
• Vendors incorporate many other
components: hdfs, hive, spark, solr, flume,
etc.
• For example, there are significant differences
in Hadoop and HBase workloads
– Hadoop MapReduce is is a framework to
process large set of data, using distributed
and parallel algorithms
– HBase is much better for real-time
read/write/modify access to tabular data
Hadoop is a “zoo” of several different applications
24Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
25. For each component
let’s make a summary of
• how they work so that we can focus on the
type of workload
• what the bottlenecks could be, in the order
we usually find them
• what technique (a) (b) or (c) could apply
• what similar ‘traditional’ technology could be
used as analogy
3 standard types of analyses
We’ll check what’s underneath each component
to file them under 3 simple analysis we are all
friends with:
a. interactive workload > you are interested in
a good response time
b. batch workloads > you are interested in
maximizing utilization, optimal concurrency
and best volume/duration ratio
c. storage > used/free space
Get your feet wet!
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 25
26. Online vs streaming vs batch – frame the problem as you already know
26Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
http://www.hadoop360.com/blog/batch-vs-real-time-data-processing
27. How to get started
• HDFS is a write once, read many (or WORM-
ish) filesystem: only append to the file
– it keeps growing and growing!
• NameNode
– Monitor the disk space available to the
NameNode (local or remote when diversified
storage is used for resilience as
recommended)
• DataNode
– IO is important
– disk space another dimension
What it is
• where Hadoop cluster stores data, functions
include
– storage of the files metadata, overseeing the
health of datanode, coordination of the
access to data
• 2 main components
– NameNode, it is the master of HDFS, memory
and I/O intensive
– Datanode manages storage attached to the
nodes
HDFS is append-only file system; it does not
allow data modification
HDFS Hadoop distributed filesystem
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 27
28. How to get started/2
• Capacity analysis approach:
– (a),(b) or (c)
• Similar technology
– high level, manage it as any logical storage
device
Bottleneck
• Disk IO (volume of IOps and response time)
• Network bandwidth
• storage space [you need 4x times the raw
size of the data you will store in the HDFS.
However on average we have seen a
compression ratio of up to 10-20 for the text
files stored in HDFS. So the actual raw disk
space required is only about 30-50% of the
original uncompressed size]
HDFS Hadoop distributed filesystem/2
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 28
29. How to get started
• Bottleneck
– for every component
• Disk IO
• Network
– for node manager (slave)
• CPU
• Capacity analysis approach:
– (b)
• Similar technology
– Job Scheduler
What it is
• YARN is the architectural center of Hadoop
that allows multiple data processing engines
such as interactive SQL, real-time streaming,
data science and batch processing to handle
data stored in a single platform.
YARN Yet Another Resource Negotiator
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 29
30. How to get started
• Bottleneck
– JVM Memory metrics
– Very much workload dependent! You have to
profile your application
What it is
• Remember: it is a programming paradigm,
not a standalone application. it mainly
consist of two phases:
– In Map phase, the main work is reading data
blocks and splitting into Map tasks in parallel
processing. The result is temporarily stored in
the memory and disk
– The work in reduce stage is concentrating the
output of the same key to the same Reduce
task and processing it, output the final result.
Map&Reduce
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 30
31. How to get started
• possible bottlenecks
– Memory
– Disk IO
– Network
• Capacity analysis approach
– (a) or (b)
• Similar technology
– data warehouse
What they are
• Hive provides a warehouse structure and
SQL-like access for data in HDFS and other
Hadoop input sources (e.g. Amazon S3).Hive's
query language, HiveQL, compiles to
MapReduce
• Pig is a high-level language for writing
queries over large datasets. A query planner
compiles, queries written in this language
(called "Pig Latin") into maps and reduces
which are then executed on a Hadoop
cluster. Pig main features are: ease of
programming, optimization opportunities,
customization, extensibility
Pig & Hive
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 31
32. How to get started
• possible bottlenecks
– Memory (be careful of swapping, JVM
memory metrics and GC) GC pauses longer
than 60 seconds can cause RS to go offline
– Disk IO (in case data is spooled to disk)
– Network (latency)
• Capacity analysis approach
– (a)
• Similar technology
– distributed DBMS
What it is
• HBase is column-based rather than row-based,
which enables high-speed execution of
operations performed over similar values across
massive data sets,
• HBase directly runs on top of HDFS
• It scales linearly by requiring all tables to have a
primary key. The key space is divided into
sequential blocks that are then allotted to a
region. RegionServers own one or more regions,
so the load is spread uniformly across the
cluster. HBase can further subdivide the region
by splitting it automatically, so that manual data
sharding is not necessary.
HBASE
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 32
33. 33
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
34. • CPU
Utilization (user/sys/wio)
load
• Memory
Utilization
used (cached, user, sys)
swap in/out
• disk IO
read/write ops rate
read/write ops byte rate
• network
sent/received packets and bits
• Garbage Collection
collections count and time
overhead (time percentage spent in GC), very
important
• Heap memory
– Size, used
– used after GC (much more valuable, you can
correlate it with workload)
– Perm Gen/Code Cache/Eden Space 'used'
– PS Old/Perm/ Gen 'used'
– Tenured Gen 'used'
– PS Eden/Survivor/PS Survivor Space 'used'
• JVM threads
Count
daemon count
• JVM files
JVM open/max open files
It sounds all good so far but which metrics do I need?
Laundry list – generic metrics
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 34
35. • HDFS namenode:
storage: total and used capacity
Files created/total/deleted
• HDFS datanode
Fs: bytes read/written
Fs: reads/writes from local/remote client
“map reduce blocks”: volume of read,
written/removed/replicated/verified
“map reduce blocks operations”:
copy/read/replace/write, avg time/volume
• YARN resource manager
active/decom/unhealthy NodeManagers
active applications/users
applications submitted, completed, failed,
killed
applications pending, running
containers
allocated/released/pending/reserved
• HBASE
– Request (total/read/write)
– memory stores size, upper limit
– flush queue length
– compaction queue length
• ZooKeeper
– sent/received packets
– request latency
– outstanding requests
– JVM pool size
• Solr
– request rate/latency
– JVM pool size
– added docs rate
– query result cache size, hits %, response time
– document cache size
It sounds all good so far but which metrics do I need?
Laundry list – specific metrics
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 35
36. Headquarters
Via Schiaffino 11C
20158 Milan MI
Italy
T +39-024951-7001
USA East
One Boston Place, Floor 26
Boston, MA 02108
USA
T +1-617-936-0212
USA West
425 Broadway Street
Redwood City, CA 94063
USA
T +1-650-226-4274
moviri.com
37. ● Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level
operators and you can use it interactively to query data within the shell. It is a comprehensive, unified framework
to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to
Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use
case.
● Features
everything is in memory
data is stored in memory into a number of files (RDD files)
best for cyclic jobs
best perf with cyclic job (performance 100 times better wrt hadoop)
● possible bottlenecks:
Memory
Network + Disk IO (remote/local files)
CPU
● Capacity analysis approach: (a) or (b) depending on the workload
● Similar technology: similar to Hadoop MapReduce generic case
Spark
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 37
38. ● Applications can leverage these services to coordinate distributed processing across large clusters. A
very large Hadoop cluster can be supported by multiple ZooKeeper servers.
● Each client machine communicates with one of the ZooKeeper servers to retrieve and update its
synchronization information. Often network and memory problems manifest themselves first in ZK
● possible bottlenecks:
CPU wio
Memory (JVM) latency
GC pauses longer than 60 seconds can cause RS to go offline
Network (latency)
● Capacity analysis approach: (a)
● Similar technology: in-memory database
Zookeeper
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 38
39. ● Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous
master-less replication allowing low latency operations for all clients
● possible bottlenecks:
Memory
Disk IO
Network
● Capacity analysis approach: (a) and (c)
● Similar technology: distributed DBMS
Cassandra
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 39
40. ● Solr is highly scalable: it provide distributed search and index replication. It is the most popular search
engine
● possible bottlenecks:
Memory (at the JVM level)
CPU
Disk IO
● Capacity analysis approach: (a)
● Similar technology: distributed DBMS
Solr
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 40
Notas del editor
Abstract
Hadoop is a zoo of different types of workloads; even if most companies are simply using Hadoop to store information (HDFS), there are many other applications, to name a few hdfs, hive, pig, impala, spark, solr, flume.
Each animal in this zoo behaves differently and, for example, there are significant differences in the two most common workloads “MapReduce” and “HBase”:
This leads to mainly three point of views for analysis to make sure service levels are achieved:
Interest in response time for “interactive workload” CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way
Interest in high throughput for “batch workloads”: Maximize the utilization levels, not interested in response time
Interest in planning storage capacity (filesystem and HDFS)
This speech focuses on providing guidelines for the capacity planner to understand how to translate existing techniques and framework and to adapt them to these new technologies: in most cases “what’s old is new again”.
Renato Bonomini, Lead of US operations for Moviri – engineer at heart and by training at Politecnico of Milan, my specialties are Digital Signal Processing and in a previous life High Performance Computing. Now I help companies achieving alignment between Business and IT using optimizaton techniques, Capacity Management and Performance Management
Mattia Berlusconi holds a Degree in IT Engineering from the Politecnico of Milan. In 2012 he joined the Consulting Department of Moviri, working in the Capacity Management Business Unit as IT Performance Optimization consultant. He participates to national and international projects focused on designing and implementing capacity management solutions to allow customers to effectively manage the capacity of their on-premises and cloud IT environments. He likes photography and hiking mountains.
Giulia Rumi is a member of the Moviri Capacity Management Team. Giulia holds a MS degree in Computer Engineering from Politecnico of Milan with a thesis work focused on energy consumption in mobile devices. She joined Moviri straight out of college in 2015. She plays piano and likes sweets and comics.
Cray1: 80 MFLOPS
Neptuny/Moviri field trip at the museum of computer @ san jose in front of cray 1
IBM 350: 3.56 Mb
Today
Iphone 5s GPU: 76.8 GFLOPS, 16 GB of RAM
Applications: Predictive Analytics, Machine Learning
A few applications or analytics that revolutionized the way we live
- Moneyball
- Recommendation engines
analytics: the “new machine revolution”
http://www.transhumanist.com/volume1/moravec.htm “When will computer hardware match the human brain?”
In 1964, Isaac Asimov, wrote about a visit to the World’s Fair of 2014:
“The world of A.D. 2014 will have few routine jobs that cannot be done better by some machine than by any human being. Mankind will therefore have become largely a race of machine tenders.” http://reuvengorsht.com/2015/02/07/machines-replace-middle-management/
Bad news for human kind: it has already happened, think of Uber and other companies delegating middle management to algorithms
http://reuvengorsht.com/2015/02/07/machines-replace-middle-management
Early example of Map Reduce process – or late example of divide et impera principle?
Compare Map and Reduce functions between the Caesar example and the WordCount example typical of M&R
Compare
MPI’s
Broadcast
Gather
Scatter
Reduce
Map & Reduce
Map
Reduce
A huge difference is how modern and enterprise ready M&R is: ask a young developer to code in f77 (not the only choice, but a common one for MPI) or even in C and comparing this to the ease of development in Java for example is a much better choice
3 important points of why M&R made a difference that MPI could not popularize
the MapReduce principle http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
what is hadoop in the latest version http://www.slideshare.net/cloudera/introduction-to-yarn-and-mapreduce-2
diagram to illustrate architecture ‘hadoop architecture’ [Giulia’s paper and http://wiki.apache.org/hadoop/PoweredByYarn ]
HDFS -> YARN -> {MR2, Impala, Spark, Hbase, MPI, hive, pig}
What we are focusing on: top list (HDFS, YARN, MapReduce, Hive/Pig, Hbase)
HDFS (Hadoop distributed filesystem) is where Hadoop cluster stores data
YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform
MapReduce is a programming paradigm
Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3).
Pig A high-level data-flow language and execution framework for parallel computation.
Hbase is an open-source, distributed, versioned, column-oriented store that sits on top of HDFS.
other interesting : solr, spark, zookeeper, impala, cassandra
Spark Apache Spark is an open source big data real time processing framework built around speed, ease of use, and sophisticated analytics
ZooKeeper is an open source Apache project that provides a centralized infrastructure and services that enable synchronization across a cluster. It maintains data like configuration information, hierarchical naming space, and so on.
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Solr is an opensource enterprise search platform from the Apache Lucene project. It provides full-text search, hit highlighting, faceted search, dynamic clustering, database integration and rich document handling.
Why do you need to get on board soon?
Conversation heard at a company “we’ll start the new hadoop cluster with 500 TB and then we’ll see how much we need” there are significant amount of resources ($$$) involved in these infrastructures
As a capacity planner, don’t miss the boat!
Role the Capacity Planner and Performance Analyst
Shouldn’t the ‘hadoop user/owner’ take care of this? Distributed machine learning is still an active research topic, It is related to both machine learning and systems
While hadoop users don’t develop systems, they need to know how to choose systems. An important fact is that existing distributed systems or parallel frameworks are not particularly designed for machine learning algorithms; hadoop users can
help to affect how systems are designed
design new algorithms for existing systems
what’s available as guidelines (http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning
and http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster )
starting point planner: http://hortonworks.com/cluster-sizing-guide/
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
CURRENT PERFORMANCE ISSUES - research being developed , see
Optimization Techniques within the Hadoop Eco-system: A Survey
DOI:
10.1109/SYNASC.2014.65
scheduling performances: scheduling is one of the most important tasks in a multi-concurrent-task system : Paper from our colleague Giulia (and others) on “Optimization Techniques within the Hadoop Eco-system: a Survey”
this shows the typical optimization problems:
data locality
sticky slots problems
poor system utilization because of suboptimal distribution of tasks
unbalanced jobs
starvation and even fairness (be fair to your users)
others: pushing the envelope w/existing initiative:
starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA shivnath@cs.duke.edu]
documents from DHT [Workload Dependent Hadoop MapReduce Application Performance Modeling] Dominique A. Heger
www.cmg.org/wp-content/uploads/2013/07/m_101_61.pdf
“One size does not fit all”, example for Classic MapReduce there is not a single behaviour
you have to know your workload characterization
“Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment.”
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-guide/content/typical-workloads.html
3 standard types of analyses, we’ll check what’s underneath each component to file them under 3 simple analysis we are all friends with:
(a) interactive workload > you are interested in a good response time
(b) batch workloads > you are interested in maximizing utilization, optimal concurrency and best volume/duration ratio
(c) storage > used/free space
for each component, let’s try to make a summary of
how they work so that we can focus on the type of workload
what the bottlenecks could be, in the order we usually find them
what technique (a) (b) or (c) could apply
what similar ‘traditional’ technology could be used as analogy
2 main components
NameNode
it is the master of HDFS that directs the DataNode daemons to perform the low-level I/O tasks
it was a single point of failure for HDFS in MR1. it is no longer a single point of failure from v2: MapR has developed a "distributed NameNode," where the HDFS metadata is distributed across the cluster in "Containers,"
it maps the blocks onto the datanodes
The function of the NameNode is memory and I/O intensive.
Memory hungry!
Monitor JVM heap size
Monitor the disk space available to the NameNode (local or remote when diversified storage is used for resilience as recommended)
datanode
usually there is a datanode for each node in the cluster
it manages storage attached to the nodes
IO is important
disk space another dimension
HDFS is append-only file system; it does not allow data modification
HDFS is a write once, read many (or WORM-ish) filesystem: once a file is created, the filesystem API only allows you to append to the file, not to overwrite it. >> it keeps growing and growing!
possible bottlenecks:
Disk IO (volume of IOps and response time)
Network bandwidth
Storage
Capacity analysis approach: (a),(b) or (c)
Similar technology: high level, manage it as any logical storage device
YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing.
YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:
a global ResourceManager
a per-application ApplicationMaster
a per-node slave NodeManager
a per-application Container running on a NodeManager
The ResourceManager and the NodeManager formed the new generic system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all applications in the system. The ApplicationMaster is a framework-specific entity that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the component tasks. The ResourceManager has a scheduler, which is responsible for allocating resources to the various applications running in the cluster, according to constraints such as queue capacities and user limits. The scheduler schedules based on the resource requirements of each application. Each ApplicationMaster has responsibility for negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. From the system perspective, the ApplicationMaster runs as a normal container. The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. NodeManager and DataNode run together onto the same machine
Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3).Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Hive is widely used, and has itself become a "sub-platform" in the Hadoop ecosystem. It is best suited for batch jobs over large sets of append-only data, providing Apache Hive automatically manages the compilation, optimization, and execution of an Hive-QL statement.
Pig is a high-level language for writing queries over large datasets. A query planner compiles, queries written in this language (called "Pig Latin") into maps and reduces which are then executed on a Hadoop cluster. Pig main features are: ease of programming, optimization opportunities, customization, extensibility It abstracts the procedural style of MapReduce in the direction of the declarative style of SQL. A Pig program generally goes through three steps: load, transform, and store. At first the data on which the program has to work are loaded (in Hadoop the objects are stored in HDFS); then a set of transformations are applied to the loaded data and the mappers and reducers are handled transparently to the user; finally, if needed, the results are stored in a local file or in HDFS.
possible bottlenecks:
Memory
Disk IO
Network
Capacity analysis approach: (a) or (b)
Similar technology: data warehouse
HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. HBase directly runs on top of HDFS
HBase scales linearly by requiring all tables to have a primary key. The key space is divided into sequential blocks that are then allotted to a region. RegionServers own one or more regions, so the load is spread uniformly across the cluster. If the keys within a region are frequently accessed, HBase can further subdivide the region by splitting it automatically, so that manual data sharding is not necessary.
possible bottlenecks:
Memory (be careful of swapping) - It is recommended to discourage swapping on HBase nodes and to enable GC logging to look for large GC pauses in the log. GC pauses longer than 60 seconds can cause RS to go offline
Disk IO (in case data is spooled to disk)
Network (latency)
Capacity analysis approach: (a)
Similar technology: distributed DBMS
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell.It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
everything is in memory
data is stored in memory into a number of files (RDD files)
best for cyclic jobs
best perf with cyclic job (performance 100 times better wrt hadoop)
possible bottlenecks:
Memory
Network + Disk IO (remote/local files)
CPU
Capacity analysis approach: (a) or (b) depending on the workload
Similar technology: similar to Hadoop MapReduce generic case
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell.It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
everything is in memory
data is stored in memory into a number of files (RDD files)
best for cyclic jobs
best perf with cyclic job (performance 100 times better wrt hadoop)
possible bottlenecks:
Memory
Network + Disk IO (remote/local files)
CPU
Capacity analysis approach: (a) or (b) depending on the workload
Similar technology: similar to Hadoop MapReduce generic case