SlideShare una empresa de Scribd logo
1 de 74
Hello,
A “gentle” introduction to the world of Big Data and
the Hadoop platform
Agenda
1. Introduction
•

The history, the #BigData, a bit of theory behind…

2. What is Hadoop, part 1
•

Introducing HDFS and Map/Reduce

3. What is Hadoop, part 2
1.

The next generation (v. 2.x), Real time, …

4. Microsoft and Big Data
1.

Lambda architecture and Windows Azure, WA
Storage(s), WA HDInsight

5. Q&A
Who am I?
(Who bothers?  )

Stefano Paluello
•
•
•

Tech Lead @ SG Gaming
All around geek, passionate about
architecture, Cloud and Data
Co-founder of various start-up(s) 
How it all started…
Ops….
history
• 2002

• 2003
• 2004

Hadoop, created by Doug Cutting (part of
the Lucene project), starts as an Open
Source search engine for the Web. It has
its origins in Apache Nutch, parts of the
Lucene project (full text search engine).
Google publishes a paper describing its
own distributed file system, also called
GFS.
The first version of NDFS, Nutch
Distributed FS, implementing the
Google’s paper.
history
• 2004

Google publishes, another, paper
introducing the MapReduce algorithm

• 2005

The first version of MapReduce is
implemented in Nutch

• 2005 (end)
• 2006

(Feb)

Nutch’s MapReduce is running on NDFS

Nutch’s MapReduce and NDFS became
the core of a new Lucene’s subproject:
history
• 2008

Yahoo launches the World’s largest
Hadoop PRODUCTION site

Some Webmap size data:
• # of links between pages in the index:
roughly 1 trillion (1012) links
• Size of the output:
over 300 TB, compressed (!!!)
• # of cores to run a single MapReduce job:
over 10000
• Raw disk used in the production cluster:
Over 5 Petabytes
OK, let’s start with…

… a bit of theory
Nooo, Wait! Don’t run away
What is #BigData?
BigData is a definition, but for someone is a
buzzword (a keyword with no or not precise
meaning but sounding interesting) that is trying
to address all this “new” (really?!?) needing of
processing a lot of data.
To identify we usually use the “Three V” to
define BigData
The 3 V’s of #BigData?
Volume: the size of the data that we’re dealing with

Variety: the data is coming from a lot of different
sources

Velocity: the speed at which the data is generated
Source: www.wipro.com, July 2012
And the 4Vs of #BigData?

Font: www.wipro.com, July 2012

Source: Oracle.com
the 4Vs of #BigData (2)

Source: IBM.com
#BigData
It is predicted that between 2009 and 2020, the estimated
size of the “digital universe” will grow around 35
Zettabytes (270 bytes) per year (!!!)
1 Zettabyte = 1k Exabyte or 1M Petabyte or 1G Terabyte
Font: www.wipro.com, July 2012

#BigData market and analysis and the 3Vs definition, was
introduced by a Gartner research about 13 years ago
http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiminggartners-volume-velocity-variety-construct-for-big-data/
Big Data Lambda
Architecture
What??? Lam…what???
I said

LAMBDA !!!
Lambda Architecture
Solves the problem of computing
arbitrary functions on arbitrary data by
decomposing the problem in three layer:
The batch layer
The serving layer
The speed layer
The Batch layer
Stores all the data in an immutable, constantly
growing dataset
Accessing all the data is too expensive (even if
possible)
Precompute “query” functions are created (aka
“batch view”, high latency operations) allowing
the results to be accessed quickly
The Batch layer

Source: “Big Data”, by Manning
The Serving layer
Indexes the batch views
Loads the batch views and allows to access and
query them efficiently
Usually is a distributed database that loads in the
batch views and it’s updated by the batch layer

It requires batch updates and random reads but
does NOT require random writes.
The Speed layer
Compensate for high latency updates of the
serving layer
Provides fast incremental algorithms
Updates the realtime view while receiving new
data, without computing them like the batch
layer)
The Speed layer

Source: “Big Data”, by Manning
Recap

Source: “Big Data”, by Manning
Distributed Data 101
Just a couple of reminders…
ACID
ACID is a set of properties that guarantee that database
transactions are processed reliability
[ Source: Wikipedia ]

Atomicity: or “all or nothing”. All the modification in a
transaction must happen successfully or no changes are committed

Consistency: all my data will be always in a valid state after every
transactions.

Isolation: transactions are isolated, so any transaction is
separated and won’t affect the data of other transactions

Durability: once a transaction is committed, the related data are
safely and durably stored, regardless to errors, crashes or any
software malfunctions
CAP
CAP theorem (or Brewer’s theorem) is a set of basic
requirements that describes a distributed system

Consistency: all the server in the system will have the same data
Availability: all the server in the system will be available and they
will return all the data available (also if they could be not consistent
across the system)

Partition (tolerance): the system will continues to operate as a
whole despite arbitrary message loss or failure of a part of the
system
According to the theorem, a distributed system CANNOT satisfy all
the three requirements at the SAME time (“two out of three”
concept).
Here we are…
Your “#BigData 101” degree!
What is Hadoop?

(Part 1)
Hadoop…
Where it comes from?
The “legend” says that the name comes from
Doug Cutting (one of the founder of the project)
son’s toy elephant. So it is also the logo of the
yellow smiling elephant.
Hadoop cluster
A Hadoop cluster consist in mainly two
modules:
A way to store distributed data, the HDFS or
Hadoop Distributed File System (storage layer)
A way to process data, the MapReduce (compute
layer)

This is the core of Hadoop!
HDFS
The Hadoop Distributed File System
For a developer point of view it looks like a standard file
system
Runs on top of OS file system (extf3,…)
Designed to store a very large amount of data (petabytes
and so on) and to solve some problems that comes with
DFS e NFS
Provides fast and scalable access to the data
Stores data reliably
How does this…
HDFS under the hood
All the files loaded in Hadoop are split into chunks, called
blocks. Each block has a fixed size of 64Mb (!!!). Yes,
Megabytes!
MyData – 150Mb
Blk_01
64Mb

HDFS

Blk_02
64Mb

Blk_03, 22Mb
Datanode(s) and Namenode
Datanode is a daemon (a service in the Windows language)
running on each cluster nodes, that is responsible to store
the blocks
Namenode, is a dedicated node where all the metadata of all
the files (blocks) inside my system are stored. It’s the
directory manager of the HDFS
To access a file, a client contact the Namenode to retrieve
the list of locations for the blocks. With the locations the
client contact the Datanodes to read the data (possibly in
parallel).
Data Redundancy
Hadoop replicates each block THREE times, as it’s stored in
the HDFS.
The location of every blocks is managed by the Namenode
If a block is under-replicated (due to some failures on a
node), the Namenode is smart enough to create another
replica, until each node has three replica inside the cluster
Yes… you made your homework! If I have 100Tb of data to
store in Hadoop, I will need 300Tb of storage space.
Datanode(s) and Namenode
D

D

NN

D

D

D
Namenode availability
If the Namenode fails ALL the cluster becomes inaccessible

In the early versions the Namenode was a single point of
failure
Couple of solution are now available:
the Namenode stores the data on the network through
NFS
most production sites have two Namenode: Active and
Standby
HDFS Quick Reference
The HDFS are pretty easy to use and to remember (specially
if you come from a *nix like environment

The commands usually have the “hadoop fs” prefix
To list the content of a HDFS folder
> hadoop fs –ls

To load a file in the HDFS
> hadoop fs –put <file>

To read a file loaded into HDFS
> hadoop fs –tail <file>

And so on…
>hadoop fs –mkdir <dir>
>hadoop fs –mv <sourcefile> <destfile>
>hadoop fs –rm <file>
MapReduce
MapReduce
Processing large file serially could be a problem.
MapReduce is designed to be a very parallelized way
of managing data
Data are split into many pieces
Each piece is processed simultaneously and isolated
Data are processed in isolation by tasks called Mappers.
The result of the Mappers, is then brought together (with
a process called “Shuffle and Sort”) into a second set of
tasks, Reducers.
Mappers
Reducers
The MapReduce “HelloWorld”
All the examples and tutorials of MapReduce start
with one simple example: the Wordcount. Let’s take
a look at it.
Java code…
Using Hadoop Streaming
Hadoop Streaming allows to write Mappers and
Reducers in almost any language, rather than forcing
you to use Java

The command to run the streaming it’s a bit “tricky”
MapReduce on a “real” case
Retailer with many stores around the country
The data are written on a sequential log with date,
store location, item, price, payment
2014-01-01
2014-0101
….

London
NewCastle

Clothes 13.99£ Card
Music 05.69£ Bank

A really simple mapper will simply split all the data
and then pass them to a mapper
The mapper will calculate the Sales Total split for
every location
Python code…
How MapReduce works…
… and the Streaming
Hadoop related projects
PIG: high level language fro analyzing large data-sets. It’s
working as a compiler that produce M/R jobs

HIVE: data warehouse software facilities querying and
managing large data-sets with a SQL like language

Hbase : a scalable, distributed database that supports
structured data storage for large tables

Cassandra: a scalable multi-master database
What is Hadoop?

(Part 2)
Hadoop v 2.x
Hadoop is a pretty easy system to use, but a bit tricky
to set-up and manage
The skills required are more related to System
Management than the Dev side
Let’s add that the Apache documentation never
stood up for clarification and completeness

So, to add a bit of mess, they decided to make the v2,
that is actually changing a lot 
Hadoop v 2.x
The new Hadoop has now FOUR modules (instead of
two)
HadoopCommon: common utilities supporting all the
other modules

HDFS: an evolution of the previous distributed FS
Hadoop YARN: a fx for job scheduling and cluster
resource management

Hadoop MapReduce: a YARN based system for paralllel
processing of large data sets
Hadoop v 2.x
Hadoop v2, leveraging YARN, is aiming to become the
new OS for the data processing
Hadoop and real time
Hadoop v2, using YARN, and Storm (a free and open
source distributed real time computation system) can
compute your data in real time

Some Hadoop distribution (like Hortonworks) are
working on an effortless integration
http://hortonworks.com/blog/stream-processing-inhadoop-yarn-storm-and-the-hortonworks-data-platform/
Microsoft Azure and Hadoop
Microsoft Lambda Architecture support
Batch Layer
• WA HDInsight
• WA Blob storage
• MapReduce, Hive,
Pig,…

Speed Layer
• Federation in WA SQL
DB
• Azure Tables
• Memchached/Mongo
DB
• SQL Azure
• Reactive Extensions
(Rx)

Serving Layer
• Azure Storage
Explorer
• MS Excel (and Office
suite)
• Reporting Services
• Linq To Hive
• Analysis Services
Yahoo, Hadoop and SQL Server

Apache Hadoop

Staging Database

SQL Server Analysis Service (SSAS
Microsoft Excel and PowerPivot
Other BI Tools and Custom
Applications

SQL Server Connector (Hadoop Hive ODBC)

SQL Server
Analysis Services
(SSAS Cube)

Hadoop Data
Third Party
Database
+
Custom
Applications
MS. Net SDK for Hadoop

• .NET client libraries for
Hadoop
• Write MapReduce in Visual
Studio using C# or F#
• Debug against local data
Slave Nodes
WebClient Libraries in .Net
• WebHDFS client
library: works with files
in HDFS and Windows
Azure Blob storage
• WebHCat client library:
manages the
scheduling and
execution of jobs in an
HDInsight cluster

WebHDFS

WebHCat

• Scalable REST API

• HDInsight
job
scheduling
and
execution

• Move files in and
out and delete
from HDFS

• Perform file and
directory
functions
Reactive Extensions (Rx):
Pulling vs. Pushing Data
Interactive vs Reactive

• In interactive programming, pulling data
from a sequence that represents the
source (IEnumerator)
• In reactive programming, subscribing to a
data stream (called observable sequence in
Rx), with updates handed to it from the
source
Reactive Extensions (Rx): Pulling
vs. Pushing Data
Application
Move Next

On Next

Got next?

Interactive

Reactive

IEnumerable<T>
IEnumerator<T>

Have next!

IObservable<T>
IObserver<T>
A gentle introduction to the world of BigData and Hadoop

Más contenido relacionado

La actualidad más candente

Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonGrisha Weintraub
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceJ Singh
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and TricksImply
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 

La actualidad más candente (20)

Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduceCS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Parallel DBs, NoSQL, MapReduce
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Druid Adoption Tips and Tricks
Druid Adoption Tips and TricksDruid Adoption Tips and Tricks
Druid Adoption Tips and Tricks
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Destacado

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
QAT Global Overview 2013
QAT Global Overview 2013QAT Global Overview 2013
QAT Global Overview 2013QAT Global
 
Real scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the CloudReal scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the CloudStefano Paluello
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...Romeo Kienzler
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Using MongoDB with the .Net Framework
Using MongoDB with the .Net FrameworkUsing MongoDB with the .Net Framework
Using MongoDB with the .Net FrameworkStefano Paluello
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Usama Fayyad
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 

Destacado (20)

Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Leveraging the Power of Big Data
Leveraging the Power of Big DataLeveraging the Power of Big Data
Leveraging the Power of Big Data
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
What is big data?
What is big data?What is big data?
What is big data?
 
How to use asana
How to use asanaHow to use asana
How to use asana
 
QAT Global Overview 2013
QAT Global Overview 2013QAT Global Overview 2013
QAT Global Overview 2013
 
Real scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the CloudReal scenario: moving a legacy app to the Cloud
Real scenario: moving a legacy app to the Cloud
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data ServicesMarlabs Capabilities Overview: DWBI, Analytics and Big Data Services
Marlabs Capabilities Overview: DWBI, Analytics and Big Data Services
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
I'm being followed by drones
I'm being followed by dronesI'm being followed by drones
I'm being followed by drones
 
BIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECTBIGDATA & HADOOP PROJECT
BIGDATA & HADOOP PROJECT
 
Using MongoDB with the .Net Framework
Using MongoDB with the .Net FrameworkUsing MongoDB with the .Net Framework
Using MongoDB with the .Net Framework
 
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
Keynote talk at Financial Times Forum - BigData and Advanced Analytics at SIB...
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 

Similar a A gentle introduction to the world of BigData and Hadoop

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 

Similar a A gentle introduction to the world of BigData and Hadoop (20)

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 

Más de Stefano Paluello

Más de Stefano Paluello (6)

Clinical Data and AI
Clinical Data and AIClinical Data and AI
Clinical Data and AI
 
Windows Azure Overview
Windows Azure OverviewWindows Azure Overview
Windows Azure Overview
 
TDD with Visual Studio 2010
TDD with Visual Studio 2010TDD with Visual Studio 2010
TDD with Visual Studio 2010
 
Asp.Net MVC Intro
Asp.Net MVC IntroAsp.Net MVC Intro
Asp.Net MVC Intro
 
Entity Framework 4
Entity Framework 4Entity Framework 4
Entity Framework 4
 
Teamwork and agile methodologies
Teamwork and agile methodologiesTeamwork and agile methodologies
Teamwork and agile methodologies
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

A gentle introduction to the world of BigData and Hadoop

  • 1. Hello, A “gentle” introduction to the world of Big Data and the Hadoop platform
  • 2. Agenda 1. Introduction • The history, the #BigData, a bit of theory behind… 2. What is Hadoop, part 1 • Introducing HDFS and Map/Reduce 3. What is Hadoop, part 2 1. The next generation (v. 2.x), Real time, … 4. Microsoft and Big Data 1. Lambda architecture and Windows Azure, WA Storage(s), WA HDInsight 5. Q&A
  • 3. Who am I? (Who bothers?  ) Stefano Paluello • • • Tech Lead @ SG Gaming All around geek, passionate about architecture, Cloud and Data Co-founder of various start-up(s) 
  • 4.
  • 5. How it all started…
  • 7. history • 2002 • 2003 • 2004 Hadoop, created by Doug Cutting (part of the Lucene project), starts as an Open Source search engine for the Web. It has its origins in Apache Nutch, parts of the Lucene project (full text search engine). Google publishes a paper describing its own distributed file system, also called GFS. The first version of NDFS, Nutch Distributed FS, implementing the Google’s paper.
  • 8. history • 2004 Google publishes, another, paper introducing the MapReduce algorithm • 2005 The first version of MapReduce is implemented in Nutch • 2005 (end) • 2006 (Feb) Nutch’s MapReduce is running on NDFS Nutch’s MapReduce and NDFS became the core of a new Lucene’s subproject:
  • 9. history • 2008 Yahoo launches the World’s largest Hadoop PRODUCTION site Some Webmap size data: • # of links between pages in the index: roughly 1 trillion (1012) links • Size of the output: over 300 TB, compressed (!!!) • # of cores to run a single MapReduce job: over 10000 • Raw disk used in the production cluster: Over 5 Petabytes
  • 10.
  • 11.
  • 12. OK, let’s start with… … a bit of theory
  • 14. What is #BigData? BigData is a definition, but for someone is a buzzword (a keyword with no or not precise meaning but sounding interesting) that is trying to address all this “new” (really?!?) needing of processing a lot of data. To identify we usually use the “Three V” to define BigData
  • 15. The 3 V’s of #BigData? Volume: the size of the data that we’re dealing with Variety: the data is coming from a lot of different sources Velocity: the speed at which the data is generated
  • 17. And the 4Vs of #BigData? Font: www.wipro.com, July 2012 Source: Oracle.com
  • 18. the 4Vs of #BigData (2) Source: IBM.com
  • 19. #BigData It is predicted that between 2009 and 2020, the estimated size of the “digital universe” will grow around 35 Zettabytes (270 bytes) per year (!!!) 1 Zettabyte = 1k Exabyte or 1M Petabyte or 1G Terabyte Font: www.wipro.com, July 2012 #BigData market and analysis and the 3Vs definition, was introduced by a Gartner research about 13 years ago http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiminggartners-volume-velocity-variety-construct-for-big-data/
  • 20.
  • 23. Lambda Architecture Solves the problem of computing arbitrary functions on arbitrary data by decomposing the problem in three layer: The batch layer The serving layer The speed layer
  • 24. The Batch layer Stores all the data in an immutable, constantly growing dataset Accessing all the data is too expensive (even if possible) Precompute “query” functions are created (aka “batch view”, high latency operations) allowing the results to be accessed quickly
  • 25. The Batch layer Source: “Big Data”, by Manning
  • 26. The Serving layer Indexes the batch views Loads the batch views and allows to access and query them efficiently Usually is a distributed database that loads in the batch views and it’s updated by the batch layer It requires batch updates and random reads but does NOT require random writes.
  • 27. The Speed layer Compensate for high latency updates of the serving layer Provides fast incremental algorithms Updates the realtime view while receiving new data, without computing them like the batch layer)
  • 28. The Speed layer Source: “Big Data”, by Manning
  • 30.
  • 31. Distributed Data 101 Just a couple of reminders…
  • 32. ACID ACID is a set of properties that guarantee that database transactions are processed reliability [ Source: Wikipedia ] Atomicity: or “all or nothing”. All the modification in a transaction must happen successfully or no changes are committed Consistency: all my data will be always in a valid state after every transactions. Isolation: transactions are isolated, so any transaction is separated and won’t affect the data of other transactions Durability: once a transaction is committed, the related data are safely and durably stored, regardless to errors, crashes or any software malfunctions
  • 33. CAP CAP theorem (or Brewer’s theorem) is a set of basic requirements that describes a distributed system Consistency: all the server in the system will have the same data Availability: all the server in the system will be available and they will return all the data available (also if they could be not consistent across the system) Partition (tolerance): the system will continues to operate as a whole despite arbitrary message loss or failure of a part of the system According to the theorem, a distributed system CANNOT satisfy all the three requirements at the SAME time (“two out of three” concept).
  • 34. Here we are… Your “#BigData 101” degree!
  • 36. Hadoop… Where it comes from? The “legend” says that the name comes from Doug Cutting (one of the founder of the project) son’s toy elephant. So it is also the logo of the yellow smiling elephant.
  • 37. Hadoop cluster A Hadoop cluster consist in mainly two modules: A way to store distributed data, the HDFS or Hadoop Distributed File System (storage layer) A way to process data, the MapReduce (compute layer) This is the core of Hadoop!
  • 38. HDFS The Hadoop Distributed File System For a developer point of view it looks like a standard file system Runs on top of OS file system (extf3,…) Designed to store a very large amount of data (petabytes and so on) and to solve some problems that comes with DFS e NFS Provides fast and scalable access to the data Stores data reliably
  • 39.
  • 41. HDFS under the hood All the files loaded in Hadoop are split into chunks, called blocks. Each block has a fixed size of 64Mb (!!!). Yes, Megabytes! MyData – 150Mb Blk_01 64Mb HDFS Blk_02 64Mb Blk_03, 22Mb
  • 42. Datanode(s) and Namenode Datanode is a daemon (a service in the Windows language) running on each cluster nodes, that is responsible to store the blocks Namenode, is a dedicated node where all the metadata of all the files (blocks) inside my system are stored. It’s the directory manager of the HDFS To access a file, a client contact the Namenode to retrieve the list of locations for the blocks. With the locations the client contact the Datanodes to read the data (possibly in parallel).
  • 43. Data Redundancy Hadoop replicates each block THREE times, as it’s stored in the HDFS. The location of every blocks is managed by the Namenode If a block is under-replicated (due to some failures on a node), the Namenode is smart enough to create another replica, until each node has three replica inside the cluster Yes… you made your homework! If I have 100Tb of data to store in Hadoop, I will need 300Tb of storage space.
  • 45. Namenode availability If the Namenode fails ALL the cluster becomes inaccessible In the early versions the Namenode was a single point of failure Couple of solution are now available: the Namenode stores the data on the network through NFS most production sites have two Namenode: Active and Standby
  • 46. HDFS Quick Reference The HDFS are pretty easy to use and to remember (specially if you come from a *nix like environment The commands usually have the “hadoop fs” prefix To list the content of a HDFS folder > hadoop fs –ls To load a file in the HDFS > hadoop fs –put <file> To read a file loaded into HDFS > hadoop fs –tail <file> And so on… >hadoop fs –mkdir <dir> >hadoop fs –mv <sourcefile> <destfile> >hadoop fs –rm <file>
  • 48.
  • 49. MapReduce Processing large file serially could be a problem. MapReduce is designed to be a very parallelized way of managing data Data are split into many pieces Each piece is processed simultaneously and isolated Data are processed in isolation by tasks called Mappers. The result of the Mappers, is then brought together (with a process called “Shuffle and Sort”) into a second set of tasks, Reducers.
  • 52. The MapReduce “HelloWorld” All the examples and tutorials of MapReduce start with one simple example: the Wordcount. Let’s take a look at it.
  • 54.
  • 55. Using Hadoop Streaming Hadoop Streaming allows to write Mappers and Reducers in almost any language, rather than forcing you to use Java The command to run the streaming it’s a bit “tricky”
  • 56.
  • 57. MapReduce on a “real” case Retailer with many stores around the country The data are written on a sequential log with date, store location, item, price, payment 2014-01-01 2014-0101 …. London NewCastle Clothes 13.99£ Card Music 05.69£ Bank A really simple mapper will simply split all the data and then pass them to a mapper The mapper will calculate the Sales Total split for every location
  • 60. … and the Streaming
  • 61. Hadoop related projects PIG: high level language fro analyzing large data-sets. It’s working as a compiler that produce M/R jobs HIVE: data warehouse software facilities querying and managing large data-sets with a SQL like language Hbase : a scalable, distributed database that supports structured data storage for large tables Cassandra: a scalable multi-master database
  • 63. Hadoop v 2.x Hadoop is a pretty easy system to use, but a bit tricky to set-up and manage The skills required are more related to System Management than the Dev side Let’s add that the Apache documentation never stood up for clarification and completeness So, to add a bit of mess, they decided to make the v2, that is actually changing a lot 
  • 64. Hadoop v 2.x The new Hadoop has now FOUR modules (instead of two) HadoopCommon: common utilities supporting all the other modules HDFS: an evolution of the previous distributed FS Hadoop YARN: a fx for job scheduling and cluster resource management Hadoop MapReduce: a YARN based system for paralllel processing of large data sets
  • 65. Hadoop v 2.x Hadoop v2, leveraging YARN, is aiming to become the new OS for the data processing
  • 66. Hadoop and real time Hadoop v2, using YARN, and Storm (a free and open source distributed real time computation system) can compute your data in real time Some Hadoop distribution (like Hortonworks) are working on an effortless integration http://hortonworks.com/blog/stream-processing-inhadoop-yarn-storm-and-the-hortonworks-data-platform/
  • 68. Microsoft Lambda Architecture support Batch Layer • WA HDInsight • WA Blob storage • MapReduce, Hive, Pig,… Speed Layer • Federation in WA SQL DB • Azure Tables • Memchached/Mongo DB • SQL Azure • Reactive Extensions (Rx) Serving Layer • Azure Storage Explorer • MS Excel (and Office suite) • Reporting Services • Linq To Hive • Analysis Services
  • 69. Yahoo, Hadoop and SQL Server Apache Hadoop Staging Database SQL Server Analysis Service (SSAS Microsoft Excel and PowerPivot Other BI Tools and Custom Applications SQL Server Connector (Hadoop Hive ODBC) SQL Server Analysis Services (SSAS Cube) Hadoop Data Third Party Database + Custom Applications
  • 70. MS. Net SDK for Hadoop • .NET client libraries for Hadoop • Write MapReduce in Visual Studio using C# or F# • Debug against local data Slave Nodes
  • 71. WebClient Libraries in .Net • WebHDFS client library: works with files in HDFS and Windows Azure Blob storage • WebHCat client library: manages the scheduling and execution of jobs in an HDInsight cluster WebHDFS WebHCat • Scalable REST API • HDInsight job scheduling and execution • Move files in and out and delete from HDFS • Perform file and directory functions
  • 72. Reactive Extensions (Rx): Pulling vs. Pushing Data Interactive vs Reactive • In interactive programming, pulling data from a sequence that represents the source (IEnumerator) • In reactive programming, subscribing to a data stream (called observable sequence in Rx), with updates handed to it from the source
  • 73. Reactive Extensions (Rx): Pulling vs. Pushing Data Application Move Next On Next Got next? Interactive Reactive IEnumerable<T> IEnumerator<T> Have next! IObservable<T> IObserver<T>

Notas del editor

  1. Example of data to storetransactions (financial, government related)logs (records of activity, location)business data (product catalogs, prices, customers)user data (images, documents, video)sensor data (temperature, pollution)medical data (x-rays, brain activity records)social (email, twitter etc)
  2. Lambda architecture: “community” driven architecture providing a way for different BigData components to work together
  3. The batch layer run on a while(true) and recomputes the batch view from scratchIt’s quite simple to implement
  4. The speed layer will maintain the same key of the batch layer, so it will be able to recognize and select the same data.The different is that this layer will modify the data will receiving the data.
  5. RECAP…Usually the Batch Layer is implemented with HDFS – Hadoop Distributed File SystemServing Database : ElephantDB, Hbase…Speed Layer: Cassandra (map with a sorted map as a value), or Cassandra with Storm (stream access), or in memory DB
  6. RECAP…
  7. Exmaple:In the cloud, on an elastic first level system, the service should be “stateless” or at least “soft-state” (cached) and must always response to the query, even if the backend is down. So the system will be “A”, immediate responsive, and “P”, regardless a failure in the backend the system is responding to the requests
  8. Using SQL Server 2008 R2, Yahoo! enhanced its Targeting, Analytics and Optimization (TAO) infrastructure Key Points: With Big Data technology, Yahoo experienced the following benefits:Improved ad campaign effectiveness and increased advertiser spending.Cube producing 24 terabytes of data quarterly, making it the world’s largest SQL Server Analysis Services cube.Ability to handle more than 3.5 billion daily ad impressions, with hourly refresh rates.References: Microsoft case study: Yahoo! Improves Campaign Effectiveness, Boosts Ad Revenue with Big Data Solution: http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001707