SlideShare una empresa de Scribd logo
1 de 40
moviri.com
Hitchhiker’s guide for the Capacity Planner
Connecticut Computer Measurement Group
Connecticut Computer Measurement Group
Cromwell CT – April 2015
Renato Bonomini renato.bonomini@moviri.com
Capacity Management and BigData
2
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
Brought to you by…
Renato Bonomini
Lead of US operations
for Moviri
@renatobonomini
Mattia Berlusconi
Capacity
Management
Consultant
Giulia Rumi
Capacity
Management
Analyst
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 3
4
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
Handling large amount of data?High Performance Computing?
Is it new? Where does it come from? Why do I have to listen to this?
5
Cray 1, 80 MFLOPS, 1975
[A bunch of engineers on a field trip in Silicon Valley, Renato]
IBM 350, 3.56 Mb, 1956
[Wikipedia]
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
“When will computer
hardware match the
human brain?”
Hans Moravec
Robotics Institute
Carnegie Mellon
University
The need for
Analytics:
the new
“machine
revolution”
6
http://www.transhumanist.com/volume1/moravec.htm
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
February 20151964: Isaac Asimov on 2014 World’s Fair
“The world of A.D. 2014 will have
few routine jobs that cannot be
done better by some machine than
by any human being.
Mankind will therefore have
become largely a race of machine
tenders.”
“When will computer hardware match the human brain?”
7
http://reuvengorsht.com/2015/02/07/machines-replace-middle-management
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
“Map and Reduce”“Divide et Impera”
• Julius Caesar arrives in Alexandria after
defeating the Egyptian army and enters the
Ancient Library
• Surprise: there are millions of copies in the
library, how many of those are in latin?
• Caesar arranges a Centuria (80 soldiers) to
inspect each one a batch of books and report
to their Centurion the number of pages
written in Latin for their book
• The Centurion writes on a tabula the count
from each soldier; when finished he sums the
part up
All I need to know I learned from Rome
8
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
“Map and Reduce”Message Passing Interface
Wow so “Map and Reduce” was a revolution? In one sense, which one?
9
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
MPI tutorial Blaise Barney
Lawrence Livermore National Laboratory
C, Fortran
Java, Python
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
1. MapReduce makes technologies available to a wide audience
 We saw that MPI already handled similar use cases, but it was restricted mostly to University
Research and large R&D facilities
2. Reliability and commodity hardware at its base
3. It moves the needle on how to handle large amount of data
 Database: organize first, then load
 Hadoop: load first, then organize
What are the revolutions brought by MapReduce and BigData?
10Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
11
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
● “Hardware” contains libraries and utilities, stores data, and supports
jobs execution
● HDFS is the fault-tolerant, replicated distributed file-system
● YARN (Yet Another Resource Negotiator) includes several
programming models that can co-exist in the cluster and MapReduce
is only one of them
● The Application layer is composed of several frameworks, among
which Pig and Hive are the most used.
Hadoop workflow
● clients break data into small chunks to be loaded onto different data
nodes
● for each datablock, client contacts namenode and it answer with a
sorted list of 3 data nodes (every block is replicated in more than one
machine)
● the client writes the blocks directly onto the datanode, the datanode
replicates the data onto the two nodes
The most famous open-source implementation of a MapReduce
framework is Apache Hadoop
12
Optimization Techniques within the Hadoop Eco-system: a Survey
Giulia Rumi, Claudia Colella, Danilo Ardagna
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
LAYERS HADOOP 1.X HADOOP 2.X
Users
Application
layer
Programming
Models
Resource
Management
File system
Hardware
Hive/Pig
Hadoop 1.X
MapReduce
HDFS
Hive/Pig
HDFS
YARN
MapReduce
Apache Hadoop Ecosystem
13
http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
Apache Hadoop documentation
Between the Apache Hadoop
Ecosystem and the NOSQL world,
new applications are being
developed every day
https://hadoopecosystemtable.github.io
http://nosql-database.org/select-the-right-
database.html
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
Geek Fun
“A DBA walks into a NOSQL bar,
but turns and leaves
because he couldn't find a table”
(webtonull)
● HDFS (Hadoop distributed filesystem) is
where Hadoop cluster stores data
● YARN is the architectural center of Hadoop
that allows multiple data processing engines
● MapReduce is a programming paradigm
● Hive provides a warehouse structure and
SQL-like access for data in HDFS
● Pig A high-level data-flow language
● Hbase is an open-source, distributed,
versioned, column-oriented store that sits
on top of HDFS.
• Apache Spark is an open source big data real
time processing framework
• ZooKeeper is an open source Apache project
that provides a centralized infrastructure and
services that enable synchronization across a
cluster
• Apache Cassandra is an open source
distributed database management system
designed to handle large amounts of data
across many commodity servers
• Solr is an opensource enterprise search
platform from the Apache Lucene project. It
provides full-text search, hit highlighting,
faceted search, dynamic clustering, database
integration and rich document handling.
We are going to focus on a few specific “animals” of this zoo
15Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
16
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
“we’ll start the new Hadoop cluster with 500 TB and then we’ll see how much we need”
Real conversation at customer
Why do you need to get on board soon?
There are significant resources and areas of improvement
● Significant investments are being directed towards these
initiatives
● They are complex, large, with hundredths of configuration
parameters: a little help from experienced capacity
planner can save a lot of money
17Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
● Shouldn’t the ‘Hadoop user/owner’ take care of this?
 Distributed machine learning is still an active research topic, It is related to both
machine learning and systems
 While Hadoop users don’t develop systems, they need to know how to choose
systems. An important fact is that existing distributed systems or parallel
frameworks are not particularly designed for machine learning algorithms
● Hadoop users can
 help to affect how systems are designed
 design new algorithms for existing systems
Role of the Capacity Planner and Performance Analyst
18Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
● http://blog.cloudera.com/blog/2010/08/hadoophbase-
capacity-planning
● http://blog.cloudera.com/blog/2013/08/how-to-select-
the-right-hardware-for-your-new-hadoop-cluster
● http://hortonworks.com/cluster-sizing-guide
● http://docs.hortonworks.com/HDPDocuments/HDP1/H
DP-1.3.7/bk_cluster-planning-
guide/content/ch_hardware-recommendations.html
Vendor guidelines – if you are in a hurry you can stop here
19Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
● Scheduling is one of the most important tasks in a multi-
concurrent-task system: research from our colleague
Giulia (and others) on “Optimization Techniques within
the Hadoop Eco-system: a Survey” [DOI: 10.1109/SYNASC.2014.65]
● This illustrates the typical optimization problems:
 data locality
 sticky slots problems
 poor system utilization because of suboptimal distribution
of tasks
 unbalanced jobs
 starvation and even fairness (be fair to your users)
● There are hundredths of configuration variables available
to the end-user: rule of thumb vs. optimal configuration
can make a big difference
Current performance tuning opportunities:
Scheduling
20Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
● Other initiatives
 starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of
MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA
shivnath@cs.duke.edu]
 Research from Dominique A. Heger of DHT [Workload Dependent Hadoop MapReduce Application
Performance Modeling]
● The common result of most research initiatives is “One size does not fit all”
 Example for classic MapReduce: there is not a single behavior, you have to know your workload
characterization
 “Hortonworks recommends that you either use the Balanced workload configuration or invest in a
pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment”
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-
guide/content/typical-workloads.html
How are other configuration opportunities being pursued?
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 21
● You want to know what is the limiting
factor of each workload
● Examples are
 CPU performance
 Disk I/O
 Memory (bandwidth and latency)
 Network (bandwidth, delay, packet loss)
 storage space
● This is nothing new for the
wise Capacity Planner!
Profiling your workload
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 22
Courtesy of Intel
23
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
Different point of views for analysis
• Interest in fast response for
“interactive workload”
– CPU, Memory, Network and IO utilization
levels to respond to queries in a quick and
effective way
• Interest in high throughput for
“batch workloads”
– Maximize the utilization levels, not interested
in response time
• Interest in storage capacity
– Understand and plan file system and HDFS
Different types of Workload
• Most companies are simply using Hadoop to
store information (HDFS) for big data-sets
• Vendors incorporate many other
components: hdfs, hive, spark, solr, flume,
etc.
• For example, there are significant differences
in Hadoop and HBase workloads
– Hadoop MapReduce is is a framework to
process large set of data, using distributed
and parallel algorithms
– HBase is much better for real-time
read/write/modify access to tabular data
Hadoop is a “zoo” of several different applications
24Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
For each component
let’s make a summary of
• how they work so that we can focus on the
type of workload
• what the bottlenecks could be, in the order
we usually find them
• what technique (a) (b) or (c) could apply
• what similar ‘traditional’ technology could be
used as analogy
3 standard types of analyses
We’ll check what’s underneath each component
to file them under 3 simple analysis we are all
friends with:
a. interactive workload > you are interested in
a good response time
b. batch workloads > you are interested in
maximizing utilization, optimal concurrency
and best volume/duration ratio
c. storage > used/free space
Get your feet wet!
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 25
Online vs streaming vs batch – frame the problem as you already know
26Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
http://www.hadoop360.com/blog/batch-vs-real-time-data-processing
How to get started
• HDFS is a write once, read many (or WORM-
ish) filesystem: only append to the file
– it keeps growing and growing!
• NameNode
– Monitor the disk space available to the
NameNode (local or remote when diversified
storage is used for resilience as
recommended)
• DataNode
– IO is important
– disk space another dimension
What it is
• where Hadoop cluster stores data, functions
include
– storage of the files metadata, overseeing the
health of datanode, coordination of the
access to data
• 2 main components
– NameNode, it is the master of HDFS, memory
and I/O intensive
– Datanode manages storage attached to the
nodes
HDFS is append-only file system; it does not
allow data modification
HDFS Hadoop distributed filesystem
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 27
How to get started/2
• Capacity analysis approach:
– (a),(b) or (c)
• Similar technology
– high level, manage it as any logical storage
device
Bottleneck
• Disk IO (volume of IOps and response time)
• Network bandwidth
• storage space [you need 4x times the raw
size of the data you will store in the HDFS.
However on average we have seen a
compression ratio of up to 10-20 for the text
files stored in HDFS. So the actual raw disk
space required is only about 30-50% of the
original uncompressed size]
HDFS Hadoop distributed filesystem/2
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 28
How to get started
• Bottleneck
– for every component
• Disk IO
• Network
– for node manager (slave)
• CPU
• Capacity analysis approach:
– (b)
• Similar technology
– Job Scheduler
What it is
• YARN is the architectural center of Hadoop
that allows multiple data processing engines
such as interactive SQL, real-time streaming,
data science and batch processing to handle
data stored in a single platform.
YARN Yet Another Resource Negotiator
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 29
How to get started
• Bottleneck
– JVM Memory metrics
– Very much workload dependent! You have to
profile your application
What it is
• Remember: it is a programming paradigm,
not a standalone application. it mainly
consist of two phases:
– In Map phase, the main work is reading data
blocks and splitting into Map tasks in parallel
processing. The result is temporarily stored in
the memory and disk
– The work in reduce stage is concentrating the
output of the same key to the same Reduce
task and processing it, output the final result.
Map&Reduce
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 30
How to get started
• possible bottlenecks
– Memory
– Disk IO
– Network
• Capacity analysis approach
– (a) or (b)
• Similar technology
– data warehouse
What they are
• Hive provides a warehouse structure and
SQL-like access for data in HDFS and other
Hadoop input sources (e.g. Amazon S3).Hive's
query language, HiveQL, compiles to
MapReduce
• Pig is a high-level language for writing
queries over large datasets. A query planner
compiles, queries written in this language
(called "Pig Latin") into maps and reduces
which are then executed on a Hadoop
cluster. Pig main features are: ease of
programming, optimization opportunities,
customization, extensibility
Pig & Hive
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 31
How to get started
• possible bottlenecks
– Memory (be careful of swapping, JVM
memory metrics and GC) GC pauses longer
than 60 seconds can cause RS to go offline
– Disk IO (in case data is spooled to disk)
– Network (latency)
• Capacity analysis approach
– (a)
• Similar technology
– distributed DBMS
What it is
• HBase is column-based rather than row-based,
which enables high-speed execution of
operations performed over similar values across
massive data sets,
• HBase directly runs on top of HDFS
• It scales linearly by requiring all tables to have a
primary key. The key space is divided into
sequential blocks that are then allotted to a
region. RegionServers own one or more regions,
so the load is spread uniformly across the
cluster. HBase can further subdivide the region
by splitting it automatically, so that manual data
sharding is not necessary.
HBASE
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 32
33
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
• CPU
 Utilization (user/sys/wio)
 load
• Memory
 Utilization
 used (cached, user, sys)
 swap in/out
• disk IO
 read/write ops rate
 read/write ops byte rate
• network
 sent/received packets and bits
• Garbage Collection
 collections count and time
 overhead (time percentage spent in GC), very
important
• Heap memory
– Size, used
– used after GC (much more valuable, you can
correlate it with workload)
– Perm Gen/Code Cache/Eden Space 'used'
– PS Old/Perm/ Gen 'used'
– Tenured Gen 'used'
– PS Eden/Survivor/PS Survivor Space 'used'
• JVM threads
 Count
 daemon count
• JVM files
 JVM open/max open files
It sounds all good so far but which metrics do I need?
Laundry list – generic metrics
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 34
• HDFS namenode:
 storage: total and used capacity
 Files created/total/deleted
• HDFS datanode
 Fs: bytes read/written
 Fs: reads/writes from local/remote client
 “map reduce blocks”: volume of read,
written/removed/replicated/verified
 “map reduce blocks operations”:
copy/read/replace/write, avg time/volume
• YARN resource manager
 active/decom/unhealthy NodeManagers
 active applications/users
 applications submitted, completed, failed,
killed
 applications pending, running
 containers
allocated/released/pending/reserved
• HBASE
– Request (total/read/write)
– memory stores size, upper limit
– flush queue length
– compaction queue length
• ZooKeeper
– sent/received packets
– request latency
– outstanding requests
– JVM pool size
• Solr
– request rate/latency
– JVM pool size
– added docs rate
– query result cache size, hits %, response time
– document cache size
It sounds all good so far but which metrics do I need?
Laundry list – specific metrics
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 35
Headquarters
Via Schiaffino 11C
20158 Milan MI
Italy
T +39-024951-7001
USA East
One Boston Place, Floor 26
Boston, MA 02108
USA
T +1-617-936-0212
USA West
425 Broadway Street
Redwood City, CA 94063
USA
T +1-650-226-4274
moviri.com
● Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level
operators and you can use it interactively to query data within the shell. It is a comprehensive, unified framework
to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to
Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use
case.
● Features
 everything is in memory
 data is stored in memory into a number of files (RDD files)
 best for cyclic jobs
 best perf with cyclic job (performance 100 times better wrt hadoop)
● possible bottlenecks:
 Memory
 Network + Disk IO (remote/local files)
 CPU
● Capacity analysis approach: (a) or (b) depending on the workload
● Similar technology: similar to Hadoop MapReduce generic case
Spark
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 37
● Applications can leverage these services to coordinate distributed processing across large clusters. A
very large Hadoop cluster can be supported by multiple ZooKeeper servers.
● Each client machine communicates with one of the ZooKeeper servers to retrieve and update its
synchronization information. Often network and memory problems manifest themselves first in ZK
● possible bottlenecks:
 CPU wio
 Memory (JVM) latency
GC pauses longer than 60 seconds can cause RS to go offline
 Network (latency)
● Capacity analysis approach: (a)
● Similar technology: in-memory database
Zookeeper
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 38
● Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous
master-less replication allowing low latency operations for all clients
● possible bottlenecks:
 Memory
 Disk IO
 Network
● Capacity analysis approach: (a) and (c)
● Similar technology: distributed DBMS
Cassandra
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 39
● Solr is highly scalable: it provide distributed search and index replication. It is the most popular search
engine
● possible bottlenecks:
 Memory (at the JVM level)
 CPU
 Disk IO
● Capacity analysis approach: (a)
● Similar technology: distributed DBMS
Solr
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 40

Más contenido relacionado

La actualidad más candente

Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopVigen Sahakyan
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 

La actualidad más candente (20)

Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 

Destacado

Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Twitter, Big Data and Health
Twitter, Big Data and Health Twitter, Big Data and Health
Twitter, Big Data and Health Ardi Priasa
 
Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1 Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1 Metron
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterChris Henry
 
Hive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing PathshalaHive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing PathshalaDesing Pathshala
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in SparkDatabricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainKamal A
 
Ansible for Drupal infrastructure and deployments
Ansible for Drupal infrastructure and deploymentsAnsible for Drupal infrastructure and deployments
Ansible for Drupal infrastructure and deploymentsJeff Geerling
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNTsuyoshi OZAWA
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點William Yeh
 
瓶頸處理九大原則 (精簡版)
瓶頸處理九大原則 (精簡版)瓶頸處理九大原則 (精簡版)
瓶頸處理九大原則 (精簡版)William Yeh
 
有了 Agile,為什麼還要有 DevOps?
有了 Agile,為什麼還要有 DevOps?有了 Agile,為什麼還要有 DevOps?
有了 Agile,為什麼還要有 DevOps?William Yeh
 
Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)William Yeh
 
DevOps for Humans - Ansible for Drupal Deployment Victory!
DevOps for Humans - Ansible for Drupal Deployment Victory!DevOps for Humans - Ansible for Drupal Deployment Victory!
DevOps for Humans - Ansible for Drupal Deployment Victory!Jeff Geerling
 

Destacado (20)

Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Twitter, Big Data and Health
Twitter, Big Data and Health Twitter, Big Data and Health
Twitter, Big Data and Health
 
Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1 Cost savings and expert system advice with athene ES/1
Cost savings and expert system advice with athene ES/1
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
Hive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing PathshalaHive - Apache hadoop Bigdata training by Desing Pathshala
Hive - Apache hadoop Bigdata training by Desing Pathshala
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQLRecommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Ansible for Drupal infrastructure and deployments
Ansible for Drupal infrastructure and deploymentsAnsible for Drupal infrastructure and deployments
Ansible for Drupal infrastructure and deployments
 
Dynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARNDynamic Resource Allocation Spark on YARN
Dynamic Resource Allocation Spark on YARN
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點Monitoring 改造計畫:流程觀點
Monitoring 改造計畫:流程觀點
 
瓶頸處理九大原則 (精簡版)
瓶頸處理九大原則 (精簡版)瓶頸處理九大原則 (精簡版)
瓶頸處理九大原則 (精簡版)
 
有了 Agile,為什麼還要有 DevOps?
有了 Agile,為什麼還要有 DevOps?有了 Agile,為什麼還要有 DevOps?
有了 Agile,為什麼還要有 DevOps?
 
Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)Immutable infrastructure:觀念與實作 (建議)
Immutable infrastructure:觀念與實作 (建議)
 
DevOps for Humans - Ansible for Drupal Deployment Victory!
DevOps for Humans - Ansible for Drupal Deployment Victory!DevOps for Humans - Ansible for Drupal Deployment Victory!
DevOps for Humans - Ansible for Drupal Deployment Victory!
 

Similar a Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner

Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...
Ataas2016 - Big data   hadoop and map reduce  - new age tools for aid to test...Ataas2016 - Big data   hadoop and map reduce  - new age tools for aid to test...
Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...Agile Testing Alliance
 
Big Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargBig Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargAgile Testing Alliance
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop EMC
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.Shakir Ali
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.Shakir Ali
 
How to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHow to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHarish Nalagandla
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 

Similar a Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner (20)

Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...
Ataas2016 - Big data   hadoop and map reduce  - new age tools for aid to test...Ataas2016 - Big data   hadoop and map reduce  - new age tools for aid to test...
Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...
 
Big Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya GargBig Data - Hadoop and MapReduce - Aditya Garg
Big Data - Hadoop and MapReduce - Aditya Garg
 
Big data Question bank.pdf
Big data Question bank.pdfBig data Question bank.pdf
Big data Question bank.pdf
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Big Data
Big DataBig Data
Big Data
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop Analyst Report : The Enterprise Use of Hadoop
Analyst Report : The Enterprise Use of Hadoop
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
How to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHow to add Artificial Intelligence Capabilities to Existing Software Platforms
How to add Artificial Intelligence Capabilities to Existing Software Platforms
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 

Último

Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner

  • 1. moviri.com Hitchhiker’s guide for the Capacity Planner Connecticut Computer Measurement Group Connecticut Computer Measurement Group Cromwell CT – April 2015 Renato Bonomini renato.bonomini@moviri.com Capacity Management and BigData
  • 2. 2 Agenda ● Why as Capacity Planners do we need to be prepared? ● The Hadoop ecosystem - or better said the zoo ● Capacity Planning and Performance Tuning of Hadoop ● How to get started ● Measure measure measure
  • 3. Brought to you by… Renato Bonomini Lead of US operations for Moviri @renatobonomini Mattia Berlusconi Capacity Management Consultant Giulia Rumi Capacity Management Analyst Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 3
  • 4. 4 Agenda ● Why as Capacity Planners do we need to be prepared? ● The Hadoop ecosystem - or better said the zoo ● Capacity Planning and Performance Tuning of Hadoop ● How to get started ● Measure measure measure
  • 5. Handling large amount of data?High Performance Computing? Is it new? Where does it come from? Why do I have to listen to this? 5 Cray 1, 80 MFLOPS, 1975 [A bunch of engineers on a field trip in Silicon Valley, Renato] IBM 350, 3.56 Mb, 1956 [Wikipedia] Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 6. “When will computer hardware match the human brain?” Hans Moravec Robotics Institute Carnegie Mellon University The need for Analytics: the new “machine revolution” 6 http://www.transhumanist.com/volume1/moravec.htm Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 7. February 20151964: Isaac Asimov on 2014 World’s Fair “The world of A.D. 2014 will have few routine jobs that cannot be done better by some machine than by any human being. Mankind will therefore have become largely a race of machine tenders.” “When will computer hardware match the human brain?” 7 http://reuvengorsht.com/2015/02/07/machines-replace-middle-management Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 8. “Map and Reduce”“Divide et Impera” • Julius Caesar arrives in Alexandria after defeating the Egyptian army and enters the Ancient Library • Surprise: there are millions of copies in the library, how many of those are in latin? • Caesar arranges a Centuria (80 soldiers) to inspect each one a batch of books and report to their Centurion the number of pages written in Latin for their book • The Centurion writes on a tabula the count from each soldier; when finished he sums the part up All I need to know I learned from Rome 8 MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 9. “Map and Reduce”Message Passing Interface Wow so “Map and Reduce” was a revolution? In one sense, which one? 9 MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat MPI tutorial Blaise Barney Lawrence Livermore National Laboratory C, Fortran Java, Python Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 10. 1. MapReduce makes technologies available to a wide audience  We saw that MPI already handled similar use cases, but it was restricted mostly to University Research and large R&D facilities 2. Reliability and commodity hardware at its base 3. It moves the needle on how to handle large amount of data  Database: organize first, then load  Hadoop: load first, then organize What are the revolutions brought by MapReduce and BigData? 10Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 11. 11 Agenda ● Why as Capacity Planners do we need to be prepared? ● The Hadoop ecosystem - or better said the zoo ● Capacity Planning and Performance Tuning of Hadoop ● How to get started ● Measure measure measure
  • 12. ● “Hardware” contains libraries and utilities, stores data, and supports jobs execution ● HDFS is the fault-tolerant, replicated distributed file-system ● YARN (Yet Another Resource Negotiator) includes several programming models that can co-exist in the cluster and MapReduce is only one of them ● The Application layer is composed of several frameworks, among which Pig and Hive are the most used. Hadoop workflow ● clients break data into small chunks to be loaded onto different data nodes ● for each datablock, client contacts namenode and it answer with a sorted list of 3 data nodes (every block is replicated in more than one machine) ● the client writes the blocks directly onto the datanode, the datanode replicates the data onto the two nodes The most famous open-source implementation of a MapReduce framework is Apache Hadoop 12 Optimization Techniques within the Hadoop Eco-system: a Survey Giulia Rumi, Claudia Colella, Danilo Ardagna Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved LAYERS HADOOP 1.X HADOOP 2.X Users Application layer Programming Models Resource Management File system Hardware Hive/Pig Hadoop 1.X MapReduce HDFS Hive/Pig HDFS YARN MapReduce
  • 13. Apache Hadoop Ecosystem 13 http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview Apache Hadoop documentation Between the Apache Hadoop Ecosystem and the NOSQL world, new applications are being developed every day https://hadoopecosystemtable.github.io http://nosql-database.org/select-the-right- database.html Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 14. Geek Fun “A DBA walks into a NOSQL bar, but turns and leaves because he couldn't find a table” (webtonull)
  • 15. ● HDFS (Hadoop distributed filesystem) is where Hadoop cluster stores data ● YARN is the architectural center of Hadoop that allows multiple data processing engines ● MapReduce is a programming paradigm ● Hive provides a warehouse structure and SQL-like access for data in HDFS ● Pig A high-level data-flow language ● Hbase is an open-source, distributed, versioned, column-oriented store that sits on top of HDFS. • Apache Spark is an open source big data real time processing framework • ZooKeeper is an open source Apache project that provides a centralized infrastructure and services that enable synchronization across a cluster • Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers • Solr is an opensource enterprise search platform from the Apache Lucene project. It provides full-text search, hit highlighting, faceted search, dynamic clustering, database integration and rich document handling. We are going to focus on a few specific “animals” of this zoo 15Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 16. 16 Agenda ● Why as Capacity Planners do we need to be prepared? ● The Hadoop ecosystem - or better said the zoo ● Capacity Planning and Performance Tuning of Hadoop ● How to get started ● Measure measure measure
  • 17. “we’ll start the new Hadoop cluster with 500 TB and then we’ll see how much we need” Real conversation at customer Why do you need to get on board soon? There are significant resources and areas of improvement ● Significant investments are being directed towards these initiatives ● They are complex, large, with hundredths of configuration parameters: a little help from experienced capacity planner can save a lot of money 17Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 18. ● Shouldn’t the ‘Hadoop user/owner’ take care of this?  Distributed machine learning is still an active research topic, It is related to both machine learning and systems  While Hadoop users don’t develop systems, they need to know how to choose systems. An important fact is that existing distributed systems or parallel frameworks are not particularly designed for machine learning algorithms ● Hadoop users can  help to affect how systems are designed  design new algorithms for existing systems Role of the Capacity Planner and Performance Analyst 18Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 19. ● http://blog.cloudera.com/blog/2010/08/hadoophbase- capacity-planning ● http://blog.cloudera.com/blog/2013/08/how-to-select- the-right-hardware-for-your-new-hadoop-cluster ● http://hortonworks.com/cluster-sizing-guide ● http://docs.hortonworks.com/HDPDocuments/HDP1/H DP-1.3.7/bk_cluster-planning- guide/content/ch_hardware-recommendations.html Vendor guidelines – if you are in a hurry you can stop here 19Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 20. ● Scheduling is one of the most important tasks in a multi- concurrent-task system: research from our colleague Giulia (and others) on “Optimization Techniques within the Hadoop Eco-system: a Survey” [DOI: 10.1109/SYNASC.2014.65] ● This illustrates the typical optimization problems:  data locality  sticky slots problems  poor system utilization because of suboptimal distribution of tasks  unbalanced jobs  starvation and even fairness (be fair to your users) ● There are hundredths of configuration variables available to the end-user: rule of thumb vs. optimal configuration can make a big difference Current performance tuning opportunities: Scheduling 20Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 21. ● Other initiatives  starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA shivnath@cs.duke.edu]  Research from Dominique A. Heger of DHT [Workload Dependent Hadoop MapReduce Application Performance Modeling] ● The common result of most research initiatives is “One size does not fit all”  Example for classic MapReduce: there is not a single behavior, you have to know your workload characterization  “Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment” http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning- guide/content/typical-workloads.html How are other configuration opportunities being pursued? Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 21
  • 22. ● You want to know what is the limiting factor of each workload ● Examples are  CPU performance  Disk I/O  Memory (bandwidth and latency)  Network (bandwidth, delay, packet loss)  storage space ● This is nothing new for the wise Capacity Planner! Profiling your workload Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 22 Courtesy of Intel
  • 23. 23 Agenda ● Why as Capacity Planners do we need to be prepared? ● The Hadoop ecosystem - or better said the zoo ● Capacity Planning and Performance Tuning of Hadoop ● How to get started ● Measure measure measure
  • 24. Different point of views for analysis • Interest in fast response for “interactive workload” – CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way • Interest in high throughput for “batch workloads” – Maximize the utilization levels, not interested in response time • Interest in storage capacity – Understand and plan file system and HDFS Different types of Workload • Most companies are simply using Hadoop to store information (HDFS) for big data-sets • Vendors incorporate many other components: hdfs, hive, spark, solr, flume, etc. • For example, there are significant differences in Hadoop and HBase workloads – Hadoop MapReduce is is a framework to process large set of data, using distributed and parallel algorithms – HBase is much better for real-time read/write/modify access to tabular data Hadoop is a “zoo” of several different applications 24Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
  • 25. For each component let’s make a summary of • how they work so that we can focus on the type of workload • what the bottlenecks could be, in the order we usually find them • what technique (a) (b) or (c) could apply • what similar ‘traditional’ technology could be used as analogy 3 standard types of analyses We’ll check what’s underneath each component to file them under 3 simple analysis we are all friends with: a. interactive workload > you are interested in a good response time b. batch workloads > you are interested in maximizing utilization, optimal concurrency and best volume/duration ratio c. storage > used/free space Get your feet wet! Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 25
  • 26. Online vs streaming vs batch – frame the problem as you already know 26Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved http://www.hadoop360.com/blog/batch-vs-real-time-data-processing
  • 27. How to get started • HDFS is a write once, read many (or WORM- ish) filesystem: only append to the file – it keeps growing and growing! • NameNode – Monitor the disk space available to the NameNode (local or remote when diversified storage is used for resilience as recommended) • DataNode – IO is important – disk space another dimension What it is • where Hadoop cluster stores data, functions include – storage of the files metadata, overseeing the health of datanode, coordination of the access to data • 2 main components – NameNode, it is the master of HDFS, memory and I/O intensive – Datanode manages storage attached to the nodes HDFS is append-only file system; it does not allow data modification HDFS Hadoop distributed filesystem Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 27
  • 28. How to get started/2 • Capacity analysis approach: – (a),(b) or (c) • Similar technology – high level, manage it as any logical storage device Bottleneck • Disk IO (volume of IOps and response time) • Network bandwidth • storage space [you need 4x times the raw size of the data you will store in the HDFS. However on average we have seen a compression ratio of up to 10-20 for the text files stored in HDFS. So the actual raw disk space required is only about 30-50% of the original uncompressed size] HDFS Hadoop distributed filesystem/2 Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 28
  • 29. How to get started • Bottleneck – for every component • Disk IO • Network – for node manager (slave) • CPU • Capacity analysis approach: – (b) • Similar technology – Job Scheduler What it is • YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform. YARN Yet Another Resource Negotiator Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 29
  • 30. How to get started • Bottleneck – JVM Memory metrics – Very much workload dependent! You have to profile your application What it is • Remember: it is a programming paradigm, not a standalone application. it mainly consist of two phases: – In Map phase, the main work is reading data blocks and splitting into Map tasks in parallel processing. The result is temporarily stored in the memory and disk – The work in reduce stage is concentrating the output of the same key to the same Reduce task and processing it, output the final result. Map&Reduce Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 30
  • 31. How to get started • possible bottlenecks – Memory – Disk IO – Network • Capacity analysis approach – (a) or (b) • Similar technology – data warehouse What they are • Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3).Hive's query language, HiveQL, compiles to MapReduce • Pig is a high-level language for writing queries over large datasets. A query planner compiles, queries written in this language (called "Pig Latin") into maps and reduces which are then executed on a Hadoop cluster. Pig main features are: ease of programming, optimization opportunities, customization, extensibility Pig & Hive Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 31
  • 32. How to get started • possible bottlenecks – Memory (be careful of swapping, JVM memory metrics and GC) GC pauses longer than 60 seconds can cause RS to go offline – Disk IO (in case data is spooled to disk) – Network (latency) • Capacity analysis approach – (a) • Similar technology – distributed DBMS What it is • HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, • HBase directly runs on top of HDFS • It scales linearly by requiring all tables to have a primary key. The key space is divided into sequential blocks that are then allotted to a region. RegionServers own one or more regions, so the load is spread uniformly across the cluster. HBase can further subdivide the region by splitting it automatically, so that manual data sharding is not necessary. HBASE Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 32
  • 33. 33 Agenda ● Why as Capacity Planners do we need to be prepared? ● The Hadoop ecosystem - or better said the zoo ● Capacity Planning and Performance Tuning of Hadoop ● How to get started ● Measure measure measure
  • 34. • CPU  Utilization (user/sys/wio)  load • Memory  Utilization  used (cached, user, sys)  swap in/out • disk IO  read/write ops rate  read/write ops byte rate • network  sent/received packets and bits • Garbage Collection  collections count and time  overhead (time percentage spent in GC), very important • Heap memory – Size, used – used after GC (much more valuable, you can correlate it with workload) – Perm Gen/Code Cache/Eden Space 'used' – PS Old/Perm/ Gen 'used' – Tenured Gen 'used' – PS Eden/Survivor/PS Survivor Space 'used' • JVM threads  Count  daemon count • JVM files  JVM open/max open files It sounds all good so far but which metrics do I need? Laundry list – generic metrics Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 34
  • 35. • HDFS namenode:  storage: total and used capacity  Files created/total/deleted • HDFS datanode  Fs: bytes read/written  Fs: reads/writes from local/remote client  “map reduce blocks”: volume of read, written/removed/replicated/verified  “map reduce blocks operations”: copy/read/replace/write, avg time/volume • YARN resource manager  active/decom/unhealthy NodeManagers  active applications/users  applications submitted, completed, failed, killed  applications pending, running  containers allocated/released/pending/reserved • HBASE – Request (total/read/write) – memory stores size, upper limit – flush queue length – compaction queue length • ZooKeeper – sent/received packets – request latency – outstanding requests – JVM pool size • Solr – request rate/latency – JVM pool size – added docs rate – query result cache size, hits %, response time – document cache size It sounds all good so far but which metrics do I need? Laundry list – specific metrics Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 35
  • 36. Headquarters Via Schiaffino 11C 20158 Milan MI Italy T +39-024951-7001 USA East One Boston Place, Floor 26 Boston, MA 02108 USA T +1-617-936-0212 USA West 425 Broadway Street Redwood City, CA 94063 USA T +1-650-226-4274 moviri.com
  • 37. ● Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell. It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. ● Features  everything is in memory  data is stored in memory into a number of files (RDD files)  best for cyclic jobs  best perf with cyclic job (performance 100 times better wrt hadoop) ● possible bottlenecks:  Memory  Network + Disk IO (remote/local files)  CPU ● Capacity analysis approach: (a) or (b) depending on the workload ● Similar technology: similar to Hadoop MapReduce generic case Spark Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 37
  • 38. ● Applications can leverage these services to coordinate distributed processing across large clusters. A very large Hadoop cluster can be supported by multiple ZooKeeper servers. ● Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information. Often network and memory problems manifest themselves first in ZK ● possible bottlenecks:  CPU wio  Memory (JVM) latency GC pauses longer than 60 seconds can cause RS to go offline  Network (latency) ● Capacity analysis approach: (a) ● Similar technology: in-memory database Zookeeper Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 38
  • 39. ● Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous master-less replication allowing low latency operations for all clients ● possible bottlenecks:  Memory  Disk IO  Network ● Capacity analysis approach: (a) and (c) ● Similar technology: distributed DBMS Cassandra Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 39
  • 40. ● Solr is highly scalable: it provide distributed search and index replication. It is the most popular search engine ● possible bottlenecks:  Memory (at the JVM level)  CPU  Disk IO ● Capacity analysis approach: (a) ● Similar technology: distributed DBMS Solr Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 40

Notas del editor

  1. Abstract Hadoop is a zoo of different types of workloads; even if most companies are simply using Hadoop to store information (HDFS), there are many other applications, to name a few hdfs, hive, pig, impala, spark, solr, flume. Each animal in this zoo behaves differently and, for example, there are significant differences in the two most common workloads “MapReduce” and “HBase”: This leads to mainly three point of views for analysis to make sure service levels are achieved: Interest in response time for “interactive workload” CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way Interest in high throughput for “batch workloads”: Maximize the utilization levels, not interested in response time Interest in planning storage capacity (filesystem and HDFS) This speech focuses on providing guidelines for the capacity planner to understand how to translate existing techniques and framework and to adapt them to these new technologies: in most cases “what’s old is new again”.
  2. Renato Bonomini, Lead of US operations for Moviri – engineer at heart and by training at Politecnico of Milan, my specialties are Digital Signal Processing and in a previous life High Performance Computing. Now I help companies achieving alignment between Business and IT using optimizaton techniques, Capacity Management and Performance Management Mattia Berlusconi holds a Degree in IT Engineering from the Politecnico of Milan. In 2012 he joined the Consulting Department of Moviri, working in the Capacity Management Business Unit as IT Performance Optimization consultant. He participates to national and international projects focused on designing and implementing capacity management solutions to allow customers to effectively manage the capacity of their on-premises and cloud IT environments. He likes photography and hiking mountains. Giulia Rumi is a member of the Moviri Capacity Management Team. Giulia holds a MS degree in Computer Engineering from Politecnico of Milan with a thesis work focused on energy consumption in mobile devices. She joined Moviri straight out of college in 2015. She plays piano and likes sweets and comics.
  3. Cray1: 80 MFLOPS Neptuny/Moviri field trip at the museum of computer @ san jose in front of cray 1 IBM 350: 3.56 Mb Today Iphone 5s GPU: 76.8 GFLOPS, 16 GB of RAM
  4. Applications: Predictive Analytics, Machine Learning A few applications or analytics that revolutionized the way we live - Moneyball - Recommendation engines analytics: the “new machine revolution” http://www.transhumanist.com/volume1/moravec.htm “When will computer hardware match the human brain?”
  5. In 1964, Isaac Asimov, wrote about a visit to the World’s Fair of 2014: “The world of A.D. 2014 will have few routine jobs that cannot be done better by some machine than by any human being. Mankind will therefore have become largely a race of machine tenders.” http://reuvengorsht.com/2015/02/07/machines-replace-middle-management/ Bad news for human kind: it has already happened, think of Uber and other companies delegating middle management to algorithms http://reuvengorsht.com/2015/02/07/machines-replace-middle-management
  6. Early example of Map Reduce process – or late example of divide et impera principle? Compare Map and Reduce functions between the Caesar example and the WordCount example typical of M&R
  7. Compare MPI’s Broadcast Gather Scatter Reduce Map & Reduce Map Reduce A huge difference is how modern and enterprise ready M&R is: ask a young developer to code in f77 (not the only choice, but a common one for MPI) or even in C and comparing this to the ease of development in Java for example is a much better choice
  8. 3 important points of why M&R made a difference that MPI could not popularize
  9. the MapReduce principle http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf what is hadoop in the latest version http://www.slideshare.net/cloudera/introduction-to-yarn-and-mapreduce-2 diagram to illustrate architecture ‘hadoop architecture’ [Giulia’s paper and http://wiki.apache.org/hadoop/PoweredByYarn ] HDFS -> YARN -> {MR2, Impala, Spark, Hbase, MPI, hive, pig}
  10. https://hadoopecosystemtable.github.io http://nosql-database.org/select-the-right-database.html http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
  11. What we are focusing on: top list (HDFS, YARN, MapReduce, Hive/Pig, Hbase) HDFS (Hadoop distributed filesystem) is where Hadoop cluster stores data YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform MapReduce is a programming paradigm Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3). Pig A high-level data-flow language and execution framework for parallel computation. Hbase  is an open-source, distributed, versioned, column-oriented store that sits on top of HDFS. other interesting : solr, spark, zookeeper, impala, cassandra Spark Apache Spark is an open source big data real time processing framework built around speed, ease of use, and sophisticated analytics ZooKeeper is an open source Apache project that provides a centralized infrastructure and services that enable synchronization across a cluster. It maintains data like configuration information, hierarchical naming space, and so on. Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Solr is an opensource enterprise search platform from the Apache Lucene project. It provides full-text search, hit highlighting, faceted search, dynamic clustering, database integration and rich document handling.  
  12. Why do you need to get on board soon? Conversation heard at a company “we’ll start the new hadoop cluster with 500 TB and then we’ll see how much we need” there are significant amount of resources ($$$) involved in these infrastructures As a capacity planner, don’t miss the boat!
  13. Role the Capacity Planner and Performance Analyst Shouldn’t the ‘hadoop user/owner’ take care of this? Distributed machine learning is still an active research topic, It is related to both machine learning and systems While hadoop users don’t develop systems, they need to know how to choose systems. An important fact is that existing distributed systems or parallel frameworks are not particularly designed for machine learning algorithms; hadoop users can help to affect how systems are designed design new algorithms for existing systems
  14. what’s available as guidelines (http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning and http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster ) starting point planner: http://hortonworks.com/cluster-sizing-guide/ http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
  15. CURRENT PERFORMANCE ISSUES - research being developed , see Optimization Techniques within the Hadoop Eco-system: A Survey DOI: 10.1109/SYNASC.2014.65 scheduling performances: scheduling is one of the most important tasks in a multi-concurrent-task system : Paper from our colleague Giulia (and others) on “Optimization Techniques within the Hadoop Eco-system: a Survey” this shows the typical optimization problems: data locality sticky slots problems poor system utilization because of suboptimal distribution of tasks unbalanced jobs starvation and even fairness (be fair to your users)
  16. others: pushing the envelope w/existing initiative: starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA shivnath@cs.duke.edu] documents from DHT [Workload Dependent Hadoop MapReduce Application Performance Modeling] Dominique A. Heger www.cmg.org/wp-content/uploads/2013/07/m_101_61.pdf “One size does not fit all”, example for Classic MapReduce there is not a single behaviour you have to know your workload characterization “Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment.” http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-guide/content/typical-workloads.html
  17. 3 standard types of analyses, we’ll check what’s underneath each component to file them under 3 simple analysis we are all friends with: (a) interactive workload > you are interested in a good response time (b) batch workloads > you are interested in maximizing utilization, optimal concurrency and best volume/duration ratio (c) storage > used/free space for each component, let’s try to make a summary of how they work so that we can focus on the type of workload what the bottlenecks could be, in the order we usually find them what technique (a) (b) or (c) could apply what similar ‘traditional’ technology could be used as analogy
  18. http://www.hadoop360.com/blog/batch-vs-real-time-data-processing
  19. 2 main components NameNode it is the master of HDFS that directs the DataNode daemons to perform the low-level I/O tasks it was a single point of failure for HDFS in MR1. it is no longer a single point of failure from v2: MapR has developed a "distributed NameNode," where the HDFS metadata is distributed across the cluster in "Containers," it maps the blocks onto the datanodes The function of the NameNode is memory and I/O intensive. Memory hungry! Monitor JVM heap size Monitor the disk space available to the NameNode (local or remote when diversified storage is used for resilience as recommended) datanode usually there is a datanode for each node in the cluster it manages storage attached to the nodes IO is important disk space another dimension HDFS is append-only file system; it does not allow data modification HDFS is a write once, read many (or WORM-ish) filesystem: once a file is created, the filesystem API only allows you to append to the file, not to overwrite it. >> it keeps growing and growing!
  20. possible bottlenecks: Disk IO (volume of IOps and response time) Network bandwidth Storage Capacity analysis approach: (a),(b) or (c) Similar technology: high level, manage it as any logical storage device
  21. YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities: a global ResourceManager a per-application ApplicationMaster a per-node slave NodeManager a per-application Container running on a NodeManager The ResourceManager and the NodeManager formed the new generic system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all applications in the system. The ApplicationMaster is a framework-specific entity that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the component tasks. The ResourceManager has a scheduler, which is responsible for allocating resources to the various applications running in the cluster, according to constraints such as queue capacities and user limits. The scheduler schedules based on the resource requirements of each application. Each ApplicationMaster has responsibility for negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. From the system perspective, the ApplicationMaster runs as a normal container. The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. NodeManager and DataNode run together onto the same machine
  22. Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3).Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Hive is widely used, and has itself become a "sub-platform" in the Hadoop ecosystem. It is best suited for batch jobs over large sets of append-only data, providing Apache Hive automatically manages the compilation, optimization, and execution of an Hive-QL statement. Pig is a high-level language for writing queries over large datasets. A query planner compiles, queries written in this language (called "Pig Latin") into maps and reduces which are then executed on a Hadoop cluster. Pig main features are: ease of programming, optimization opportunities, customization, extensibility It abstracts the procedural style of MapReduce in the direction of the declarative style of SQL. A Pig program generally goes through three steps: load, transform, and store. At first the data on which the program has to work are loaded (in Hadoop the objects are stored in HDFS); then a set of transformations are applied to the loaded data and the mappers and reducers are handled transparently to the user; finally, if needed, the results are stored in a local file or in HDFS. possible bottlenecks: Memory Disk IO Network Capacity analysis approach: (a) or (b) Similar technology: data warehouse
  23. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. HBase directly runs on top of HDFS HBase scales linearly by requiring all tables to have a primary key. The key space is divided into sequential blocks that are then allotted to a region. RegionServers own one or more regions, so the load is spread uniformly across the cluster. If the keys within a region are frequently accessed, HBase can further subdivide the region by splitting it automatically, so that manual data sharding is not necessary. possible bottlenecks: Memory (be careful of swapping) - It is recommended to discourage swapping on HBase nodes and to enable GC logging to look for large GC pauses in the log. GC pauses longer than 60 seconds can cause RS to go offline Disk IO (in case data is spooled to disk) Network (latency) Capacity analysis approach: (a) Similar technology: distributed DBMS
  24. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell.It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. everything is in memory data is stored in memory into a number of files (RDD files) best for cyclic jobs best perf with cyclic job (performance 100 times better wrt hadoop) possible bottlenecks: Memory Network + Disk IO (remote/local files) CPU Capacity analysis approach: (a) or (b) depending on the workload Similar technology: similar to Hadoop MapReduce generic case
  25. Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell.It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case. everything is in memory data is stored in memory into a number of files (RDD files) best for cyclic jobs best perf with cyclic job (performance 100 times better wrt hadoop) possible bottlenecks: Memory Network + Disk IO (remote/local files) CPU Capacity analysis approach: (a) or (b) depending on the workload Similar technology: similar to Hadoop MapReduce generic case