A gentle introduction to the world of BigData and Hadoop

Hello,
A “gentle” introduction to the world of Big Data and
the Hadoop platform

Agenda
1. Introduction
•

The history, the #BigData, a bit of theory behind…

2. What is Hadoop, part 1
•

Introducing HDFS and Map/Reduce

3. What is Hadoop, part 2
1.

The next generation (v. 2.x), Real time, …

4. Microsoft and Big Data
1.

Lambda architecture and Windows Azure, WA
Storage(s), WA HDInsight

5. Q&A

Who am I?
(Who bothers?  )

Stefano Paluello
•
•
•

Tech Lead @ SG Gaming
All around geek, passionate about
architecture, Cloud and Data
Co-founder of various start-up(s) 

history
• 2002

• 2003
• 2004

Hadoop, created by Doug Cutting (part of
the Lucene project), starts as an Open
Source search engine for the Web. It has
its origins in Apache Nutch, parts of the
Lucene project (full text search engine).
Google publishes a paper describing its
own distributed file system, also called
GFS.
The first version of NDFS, Nutch
Distributed FS, implementing the
Google’s paper.

history
• 2004

Google publishes, another, paper
introducing the MapReduce algorithm

• 2005

The first version of MapReduce is
implemented in Nutch

• 2005 (end)
• 2006

(Feb)

Nutch’s MapReduce is running on NDFS

Nutch’s MapReduce and NDFS became
the core of a new Lucene’s subproject:

history
• 2008

Yahoo launches the World’s largest
Hadoop PRODUCTION site

Some Webmap size data:
• # of links between pages in the index:
roughly 1 trillion (1012) links
• Size of the output:
over 300 TB, compressed (!!!)
• # of cores to run a single MapReduce job:
over 10000
• Raw disk used in the production cluster:
Over 5 Petabytes

OK, let’s start with…

… a bit of theory

What is #BigData?
BigData is a definition, but for someone is a
buzzword (a keyword with no or not precise
meaning but sounding interesting) that is trying
to address all this “new” (really?!?) needing of
processing a lot of data.
To identify we usually use the “Three V” to
define BigData

The 3 V’s of #BigData?
Volume: the size of the data that we’re dealing with

Variety: the data is coming from a lot of different
sources

Velocity: the speed at which the data is generated

Source: www.wipro.com, July 2012

And the 4Vs of #BigData?

Font: www.wipro.com, July 2012

Source: Oracle.com

the 4Vs of #BigData (2)

Source: IBM.com

#BigData
It is predicted that between 2009 and 2020, the estimated
size of the “digital universe” will grow around 35
Zettabytes (270 bytes) per year (!!!)
1 Zettabyte = 1k Exabyte or 1M Petabyte or 1G Terabyte
Font: www.wipro.com, July 2012

#BigData market and analysis and the 3Vs definition, was
introduced by a Gartner research about 13 years ago
http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiminggartners-volume-velocity-variety-construct-for-big-data/

Big Data Lambda
Architecture
What??? Lam…what???

Lambda Architecture
Solves the problem of computing
arbitrary functions on arbitrary data by
decomposing the problem in three layer:
The batch layer
The serving layer
The speed layer

The Batch layer
Stores all the data in an immutable, constantly
growing dataset
Accessing all the data is too expensive (even if
possible)
Precompute “query” functions are created (aka
“batch view”, high latency operations) allowing
the results to be accessed quickly

The Batch layer

Source: “Big Data”, by Manning

The Serving layer
Indexes the batch views
Loads the batch views and allows to access and
query them efficiently
Usually is a distributed database that loads in the
batch views and it’s updated by the batch layer

It requires batch updates and random reads but
does NOT require random writes.

The Speed layer
Compensate for high latency updates of the
serving layer
Provides fast incremental algorithms
Updates the realtime view while receiving new
data, without computing them like the batch
layer)

The Speed layer


Recap


Distributed Data 101
Just a couple of reminders…

ACID
ACID is a set of properties that guarantee that database
transactions are processed reliability
[ Source: Wikipedia ]

Atomicity: or “all or nothing”. All the modification in a
transaction must happen successfully or no changes are committed

Consistency: all my data will be always in a valid state after every
transactions.

Isolation: transactions are isolated, so any transaction is
separated and won’t affect the data of other transactions

Durability: once a transaction is committed, the related data are
safely and durably stored, regardless to errors, crashes or any
software malfunctions

CAP
CAP theorem (or Brewer’s theorem) is a set of basic
requirements that describes a distributed system

Consistency: all the server in the system will have the same data
Availability: all the server in the system will be available and they
will return all the data available (also if they could be not consistent
across the system)

Partition (tolerance): the system will continues to operate as a
whole despite arbitrary message loss or failure of a part of the
system
According to the theorem, a distributed system CANNOT satisfy all
the three requirements at the SAME time (“two out of three”
concept).

Here we are…
Your “#BigData 101” degree!

Hadoop…
Where it comes from?
The “legend” says that the name comes from
Doug Cutting (one of the founder of the project)
son’s toy elephant. So it is also the logo of the
yellow smiling elephant.

Hadoop cluster
A Hadoop cluster consist in mainly two
modules:
A way to store distributed data, the HDFS or
Hadoop Distributed File System (storage layer)
A way to process data, the MapReduce (compute
layer)

This is the core of Hadoop!

HDFS
The Hadoop Distributed File System
For a developer point of view it looks like a standard file
system
Runs on top of OS file system (extf3,…)
Designed to store a very large amount of data (petabytes
and so on) and to solve some problems that comes with
DFS e NFS
Provides fast and scalable access to the data
Stores data reliably

HDFS under the hood
All the files loaded in Hadoop are split into chunks, called
blocks. Each block has a fixed size of 64Mb (!!!). Yes,
Megabytes!
MyData – 150Mb
Blk_01
64Mb

HDFS

Blk_02
64Mb

Blk_03, 22Mb

Datanode(s) and Namenode
Datanode is a daemon (a service in the Windows language)
running on each cluster nodes, that is responsible to store
the blocks
Namenode, is a dedicated node where all the metadata of all
the files (blocks) inside my system are stored. It’s the
directory manager of the HDFS
To access a file, a client contact the Namenode to retrieve
the list of locations for the blocks. With the locations the
client contact the Datanodes to read the data (possibly in
parallel).

Data Redundancy
Hadoop replicates each block THREE times, as it’s stored in
the HDFS.
The location of every blocks is managed by the Namenode
If a block is under-replicated (due to some failures on a
node), the Namenode is smart enough to create another
replica, until each node has three replica inside the cluster
Yes… you made your homework! If I have 100Tb of data to
store in Hadoop, I will need 300Tb of storage space.

Datanode(s) and Namenode
D

D

NN

D

D

D

Namenode availability
If the Namenode fails ALL the cluster becomes inaccessible

In the early versions the Namenode was a single point of
failure
Couple of solution are now available:
the Namenode stores the data on the network through
NFS
most production sites have two Namenode: Active and
Standby

HDFS Quick Reference
The HDFS are pretty easy to use and to remember (specially
if you come from a *nix like environment

The commands usually have the “hadoop fs” prefix
To list the content of a HDFS folder
> hadoop fs –ls

To load a file in the HDFS
> hadoop fs –put <file>

To read a file loaded into HDFS
> hadoop fs –tail <file>

And so on…
>hadoop fs –mkdir <dir>
>hadoop fs –mv <sourcefile> <destfile>
>hadoop fs –rm <file>

MapReduce
Processing large file serially could be a problem.
MapReduce is designed to be a very parallelized way
of managing data
Data are split into many pieces
Each piece is processed simultaneously and isolated
Data are processed in isolation by tasks called Mappers.
The result of the Mappers, is then brought together (with
a process called “Shuffle and Sort”) into a second set of
tasks, Reducers.

The MapReduce “HelloWorld”
All the examples and tutorials of MapReduce start
with one simple example: the Wordcount. Let’s take
a look at it.

Using Hadoop Streaming
Hadoop Streaming allows to write Mappers and
Reducers in almost any language, rather than forcing
you to use Java

The command to run the streaming it’s a bit “tricky”

MapReduce on a “real” case
Retailer with many stores around the country
The data are written on a sequential log with date,
store location, item, price, payment
2014-01-01
2014-0101
….

London
NewCastle

Clothes 13.99£ Card
Music 05.69£ Bank

A really simple mapper will simply split all the data
and then pass them to a mapper
The mapper will calculate the Sales Total split for
every location

Hadoop related projects
PIG: high level language fro analyzing large data-sets. It’s
working as a compiler that produce M/R jobs

HIVE: data warehouse software facilities querying and
managing large data-sets with a SQL like language

Hbase : a scalable, distributed database that supports
structured data storage for large tables

Cassandra: a scalable multi-master database

Hadoop v 2.x
Hadoop is a pretty easy system to use, but a bit tricky
to set-up and manage
The skills required are more related to System
Management than the Dev side
Let’s add that the Apache documentation never
stood up for clarification and completeness

So, to add a bit of mess, they decided to make the v2,
that is actually changing a lot 

Hadoop v 2.x
The new Hadoop has now FOUR modules (instead of
two)
HadoopCommon: common utilities supporting all the
other modules

HDFS: an evolution of the previous distributed FS
Hadoop YARN: a fx for job scheduling and cluster
resource management

Hadoop MapReduce: a YARN based system for paralllel
processing of large data sets

Hadoop v 2.x
Hadoop v2, leveraging YARN, is aiming to become the
new OS for the data processing

Hadoop and real time
Hadoop v2, using YARN, and Storm (a free and open
source distributed real time computation system) can
compute your data in real time

Some Hadoop distribution (like Hortonworks) are
working on an effortless integration
http://hortonworks.com/blog/stream-processing-inhadoop-yarn-storm-and-the-hortonworks-data-platform/

Microsoft Lambda Architecture support
Batch Layer
• WA HDInsight
• WA Blob storage
• MapReduce, Hive,
Pig,…

Speed Layer
• Federation in WA SQL
DB
• Azure Tables
• Memchached/Mongo
DB
• SQL Azure
• Reactive Extensions
(Rx)

Serving Layer
• Azure Storage
Explorer
• MS Excel (and Office
suite)
• Reporting Services
• Linq To Hive
• Analysis Services

Yahoo, Hadoop and SQL Server

Apache Hadoop

Staging Database

SQL Server Analysis Service (SSAS
Microsoft Excel and PowerPivot
Other BI Tools and Custom
Applications

SQL Server Connector (Hadoop Hive ODBC)

SQL Server
Analysis Services
(SSAS Cube)

Hadoop Data
Third Party
Database
+
Custom
Applications

MS. Net SDK for Hadoop

• .NET client libraries for
Hadoop
• Write MapReduce in Visual
Studio using C# or F#
• Debug against local data
Slave Nodes

WebClient Libraries in .Net
• WebHDFS client
library: works with files
in HDFS and Windows
Azure Blob storage
• WebHCat client library:
manages the
scheduling and
execution of jobs in an
HDInsight cluster

WebHDFS

WebHCat

• Scalable REST API

• HDInsight
job
scheduling
and
execution

• Move files in and
out and delete
from HDFS

• Perform file and
directory
functions

Reactive Extensions (Rx):
Pulling vs. Pushing Data
Interactive vs Reactive

• In interactive programming, pulling data
from a sequence that represents the
source (IEnumerator)
• In reactive programming, subscribing to a
data stream (called observable sequence in
Rx), with updates handed to it from the
source

Reactive Extensions (Rx): Pulling
vs. Pushing Data
Application
Move Next

On Next

Got next?

Interactive

Reactive

IEnumerable<T>
IEnumerator<T>

Have next!

IObservable<T>
IObserver<T>

A gentle introduction to the world of BigData and Hadoop

A gentle introduction to the world of BigData and Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a A gentle introduction to the world of BigData and Hadoop

Similar a A gentle introduction to the world of BigData and Hadoop (20)

Más de Stefano Paluello

Más de Stefano Paluello (6)

Último

Último (20)

A gentle introduction to the world of BigData and Hadoop

Notas del editor