2. 2www.vitech.com.ua
Agenda
Hadoop causes real
big data Industry
changes
What technology
is behind this
name?
Why Hadoop is so
promising solution?
BIG DATA
APPROACH
HADOOP
ENVIRONMENT
INDUSTRY FACE IS
CHANGING
4. 4www.vitech.com.ua
What is BIG DATA?
● Really BIG DATA things: photo banks, video storage,
historical measurements.
● Intensive data transactions and high distribution: stores
(offline or online), banks, advertising networks.
● Realtime data: measurements and minitoring, gaming.
● Intensive processing: science, modelling.
● High volumes of small things: social networks,
healthcare
BIG DATA IS EVERYWHERE
6. 6www.vitech.com.ua
WORLD is big data itself
Yet to remember....
WORLD ITSELF CAN
BE DIGITIZED TOO
● Earth weather and environment: realtime, really
big data volume, high potential for processing, lot
of things to be analysed, historical data.
● Space: unlimited potential for analysis, ocean is
yet unknow volume.
● Internet of things is going to be digital world itself.
● ???
10. 10www.vitech.com.ua
BIG DATA storage: requirements
SIMPLE BUT
RELIABLE
● Really big amount of
data is to be stored
in reliable manner.
● Storage is to be
simple, recoverable
and cheap.
11. 11www.vitech.com.ua
BIG DATA storage: requirements
DECENTRALIZED
● No single point of failure.
● Scalable as close to
linear as possible.
● No manual actions to
recover in case of
failures
12. 12www.vitech.com.ua
BIG DATA processing: requirements
SIMPLE TO USE
● Complexity is to
be burried inside.
● Interface is to be
functional and
compatible
between versions.
13. 13www.vitech.com.ua
BIG DATA processing: requirements
TOOLS TO BE CLOSE
TO WORK
● Process data on the
same nodes as it is
stored on.
● Distributed storage
— distributed
processing.
14. 14www.vitech.com.ua
BIG DATA processing: requirements
● Work is to be
balanced.
● Data placement is
to be appropriate to
balanced work.
● Amount of work is
to be balanced in
accordance to
resources.
SHARE LOAD
15. 15www.vitech.com.ua
Solution requirements in general
WHAT FINALLY DO
WE NEED?
● CPU+HDD in one place
● Cluster of replacable nodes
● Lot of storage space
● Way to control resources
and balance load
● Everything is to be
relatively simple and
affordable
x MAX
+
=
BIG
DATA
17. 17www.vitech.com.ua
What is it?
What is
HADOOP?
● Hadoop is open source
framework for big
data. Both distributed
storage and
processing.
● Hadoop is reliable and
fault tolerant with no
rely on hardware for
these properties.
● Hadoop has unique
horisontal scalability.
Currently — from
single computer up to
thousands of cluster
nodes.
18. 18www.vitech.com.ua
Facts and trends
● 2004, Was inspired by by Google
MapReduce idea. Originally was
named just after son's elephant toy.
● On June 13, 2012
Facebook announced their
Hadoop cluster has 100 PB
of data. On November 8,
2012 they announced the
warehouse grows by
roughly half a PB per day.
● On February 19, 2008, Yahoo! Inc. launched what it
claimed was the world's largest Hadoop production
application. The Yahoo! Search Webmap is a Hadoop
application that runs on a more than 10,000 core Linux
cluster.
19. 19www.vitech.com.ua
Hadoop: classical picture
Hadoop
historical
top view
● HDFS serves as file
system layer
● MapReduce originally
served as distributed
processing framework.
● Native client API is
Java but there are lot
of alternatives.
● This is only initial
architecture and it is
now more complex.
20. 20www.vitech.com.ua
HDFS top view
● Namenode is 'management' component. Keeps
'directory' of what file blocks are stored where.
● Actual work is performed by data nodes.
21. 21www.vitech.com.ua
HDFS files handling
● Files are stored in large enough blocks. Every block is
replicated to several data nodes.
● Replication is tracked by namenode. Clients only locate
blocks using namenode and actual load is taken by
datanode.
● Datanode failure leads to replication recovery. Namenode
could be backed by standby scheme.
22. 22www.vitech.com.ua
HDFS properties
● Designed for throughput, not
for latency.
● Blocks are expected to be
large. There is issue with lot of
small files.
● Write once, read many times
ideology.
● Only append, no 'edit' ability.
● Special tools are required to
implement OLTP like Apache
HBase.
HDFS is ...
23. 23www.vitech.com.ua
MapReduce framework model
● 2 steps data processing: transform and then reduce.
Really nice to do things in distributed manner.
● Large class of jobs can be adopted but not all of them.
24. 24www.vitech.com.ua
MapReduce service: top view
● One JobTracker with
redundancy
possible.
● Multiple
TaskTrackers doing
actual job.
● Ideology is similar
to HDFS handling.
● HDFS is usually
used as storage on
all phases.
MapReduce service
25. 25www.vitech.com.ua
Technology: Hadoop 2.0 concept
● New component (YARN) forms resource management
layer and completes real distributed data OS.
● MapReduce is from now only one among other YARN
appliactions.
26. 26www.vitech.com.ua
YARN: notable addition
● Resource
manager
dispatches
client requests.
● Node managers
manage node
resources.
● Any application
is set of
containers
including
application
master.
YARN service
27. 27www.vitech.com.ua
YARN: notable addition
● Better resource balance for
heterogeneous clusterss
and multple applications.
● Dynamic applications over
static services.
● Much wider applications
model over simple
MapReduce. Things like
Spark ot Tez.
Why YARN is SO
important?
28. 28www.vitech.com.ua
Hadoop current picture
● HDFS2 is now about storage and YARN is about
processing resources.
● Lot of things to do on top of this data OS starting from
traditional MapReduce. Now there is lot of alternatives.
29. 29www.vitech.com.ua
Just several items around
Infrastructure
● HBase: Scalable structured data
storage for large tables.
● Hive: A data warehouse
infrastructure that provides data
summarization and ad hoc
querying.
● Mahout: A Scalable machine
learning and data mining library.
● Pig: A high-level data-flow
language and execution
framework for parallel
computation.
● ZooKeeper: A high-performance
distributed coordination service.
30. 30www.vitech.com.ua
Most important concept
First ever world
DATA OS
10.000 nodes computer...
Recent technology changes are focused on
higher scale. Better resource usage and
control, lower MTTR, higher security,
redundancy, fault tolerance.
34. 34www.vitech.com.ua
Trends
Big data is goind BIGGER
● SSD are going to be widely used as storage
and memory based replica is not a miracle
anymore.
● Memory and SSD based caching schemes
are going to be more and more aggressive.
Particularry in HDFS and HBase.
● Clusters grow. Currently some open source
features are targeted for clusters of 1K
nodes. How about staging 300 nodes
cluster in companies like EBay?
● Production clusters go beyond 4000 nodes
(up to 10K). Node failure nearly every day.
35. 35www.vitech.com.ua
Trends
● Typecal node is expected to
include at least 64G memory
● Starting from 4 x 2T drives for
storage. 8-16 x 4T drives are not
so rare. This is for general
'workload' node.
● 10 and more CPU cores. 2 CPUs
is normal approach.
● SSD is starting to be widely used
not only for OS and caching but
for data itself.
● Main outcome — per node costs
model is changing.
HARDWARE
IS GOING
CHEAPER
37. 37www.vitech.com.ua
For whom bell tools?
Old way
● Make assumptions
about data you
need.
● Make assumptions
about data model.
● Make assumptions
about algorithms
you need.
● Get confirmation for
your initial guess
about result. Are you
surprised?
New way
● Get as much data as you can.
● Detect data model based on
set of algorithms with
extensive approach.
● Cluster your data, detect
correlations, clean from
anomalies... in all way you
can afford on whole data set.
● Get grounded results. You still
cen miss some fundamental
aspects but isn't it much
better in any case?
38. 38www.vitech.com.ua
Major Hadoop distributions
● HortonWorks are 'barely open source'. Innovative, but
'running too fast'. Most ot their key technologies are not
so mature yet.
● Cloudera is stable enough but not stale. Hadoop 2.3 with
YARN, HBase 0.96.x. Balance.
● MapR focuses on performance per node but they are
slightly outdated in term of functionality and their
distribution costs. For cases where node performance is
high priority.
● Intel is newcomer on this market. Not for near future.