SlideShare a Scribd company logo
1 of 25
Download to read offline
Federico 
Cargnelu/ 
/ 
BSkyB 
Hadoop 
& 
Distributed 
Compu<ng
Distributed 
compu<ng 
uses 
so=ware 
to 
divide 
pieces 
of 
a 
program 
among 
several 
computers. 
One 
project 
in 
par<cular 
has 
proven 
that 
the 
concept 
works 
extremely 
well.
SETI@Home 
Search 
for 
Extra-­‐Terrestrial 
Intelligence 
• Prove 
the 
viability 
of 
the 
distributed 
grid 
compu<ng 
concept 
(succeeded) 
• Detect 
intelligent 
life 
outside 
Earth 
(failed)
Distributed 
Compu6ng 
What 
problem 
are 
we 
trying 
to 
solve?
Counts 
of 
all 
the 
dis6nct 
word 
• in 
a 
file? 
• in 
a 
directory? 
• on 
the 
Web?
We 
need 
to 
process 
100TB 
datasets 
• On 
1 
node: 
o Scanning 
@ 
50MB/s 
= 
23 
days 
• On 
1000 
node 
cluster: 
o Scanning 
@ 
50MB/s 
= 
33 
min
We 
need 
a 
framework 
for 
distribu<on
We 
need 
a 
new 
paradigm
Hadoop 
is 
an 
open-­‐source 
Java 
framework 
for 
running 
applica<ons 
on 
large 
clusters 
of 
commodity 
hardware
Scalable 
Hadoop 
can 
reliably 
store 
and 
process 
petabytes 
of 
data. 
Economical 
Hadoop 
distributes 
the 
data 
and 
processing 
across 
clusters 
of 
commonly 
available 
computers. 
These 
clusters 
can 
number 
into 
the 
thousands 
of 
nodes. 
Efficient 
Hadoop 
can 
process 
the 
distributed 
data 
in 
parallel 
on 
the 
nodes 
where 
the 
data 
is 
located. 
Reliable 
Hadoop 
automa<cally 
maintains 
mul<ple 
copies 
of 
data 
and 
automa<cally 
redeploys 
compu<ng 
tasks 
based 
on 
failures.
Hadoop 
Components 
Hadoop 
Distributed 
File 
System 
(HDFS) 
• 
Java, 
Shell, 
C 
and 
HTTP 
API’s 
Hadoop 
MapReduce 
• 
Java 
and 
Streaming 
API’s 
Hadoop 
on 
Demand 
• Tools 
to 
manage 
dynamic 
setup 
and 
teardown 
of 
Hadoop 
nodes
Other 
Tools 
HBase 
Table 
storage 
on 
top 
of 
HDFS, 
modeled 
a=er 
Google’s 
Big 
Table 
Pig 
Language 
for 
dataflow 
programming 
Hive 
SQL 
interface 
to 
structured 
data 
stored 
in 
HDFS
Hadoop 
MapReduce 
• Mappers 
and 
Reducers 
are 
allocated 
• Code 
is 
shipped 
to 
nodes 
• Mappers 
and 
Reducers 
are 
run 
on 
same 
machines 
as 
DataNodes 
• Two 
major 
daemons: 
JobTracker 
and 
TaskTracker
Hadoop 
MapReduce 
JobTracker 
• 
Long-­‐lived 
master 
daemon 
which 
distributes 
tasks 
• 
Maintains 
a 
job 
history 
of 
job 
execu<on 
sta<s<cs 
TaskTrackers 
• Long-­‐lived 
client 
daemon 
which 
executes 
Map 
and 
Reduce 
tasks
Hadoop 
MapReduce 
• Setup 
a 
mul<-­‐node 
Hadoop 
cluster 
using 
the 
Hadoop 
Distributed 
File 
System 
(HDFS) 
• Create 
a 
hierarchical 
HDFS 
with 
directories 
and 
files. 
• Use 
Hadoop 
API 
to 
store 
a 
large 
text 
file. 
• Create 
a 
MapReduce 
applica<on.
• Mapper 
takes 
input 
key/value 
pair 
• Does 
something 
to 
its 
input 
• Emits 
intermediate 
key/value 
pair 
• One 
call 
per 
input 
record 
• Fully 
data-­‐parallel 
Map
Map 
(in, 
1) 
(in, 
1) 
(sunt, 
1) 
(in, 
1) 
(elit, 
1) 
(sed, 
1) 
(eiusmod, 
1)
• Input 
is 
all 
list 
of 
intermediate 
values 
for 
a 
given 
key 
• Reducer 
aggregates 
list 
of 
intermediate 
values 
• Returns 
a 
final 
key/value 
pair 
for 
output 
Reduce
Reduce 
Reduce 
(irure, 
1) 
(in, 
3) 
(ea, 
1) 
(enim, 
1) 
(eu, 
1) 
(Duis, 
1) 
(dolore, 
2)
Adobe 
-­‐ 
Use 
for 
data 
storage 
and 
processing 
-­‐ 
30 
nodes 
Facebook 
-­‐ 
Use 
for 
repor<ng 
and 
analy<cs 
-­‐ 
320 
nodes 
FOX 
-­‐ 
Use 
for 
log 
analysis 
and 
data 
mining 
-­‐ 
140 
nodes 
Who 
is 
using 
it? 
Last.fm 
-­‐ 
Use 
for 
chart 
calcula<on 
and 
log 
analysis 
-­‐ 
27 
nodes 
New 
York 
Times 
-­‐ 
Use 
for 
large 
scale 
image 
conversion 
-­‐ 
100 
nodes 
Yahoo! 
-­‐ 
Use 
for 
Ad 
systems 
and 
Web 
search 
-­‐ 
10.000 
nodes
Use 
Cases 
• Video 
and 
Image 
processing 
• Log 
analysis 
• Spam/BOT 
analysis 
• Behavioral 
analy<cs 
(CRM) 
• Sequen<al 
paiern 
analysis 
(eg. 
Understanding 
long-­‐term 
customer 
buying 
behavior 
for 
cross 
selling 
and 
target 
marke<ng)
Recommended 
Hardware 
Commodity 
servers 
• 1 
RU 
• 2 
x 
4 
core 
CPU 
• 4-­‐8GB 
of 
RAM 
using 
ECC 
memory 
• 4 
x 
1TB 
SATA 
drives 
• 1-­‐5TB 
external 
storage 
Typically 
arranged 
in 
2 
level 
architecture 
• 30/40 
nodes 
per 
rack
Challenges 
• No 
version 
and 
dependency 
management. 
• Configura<on: 
more 
than 
150 
parameters. 
• No 
security 
against 
accidents. 
User 
iden<fica<on 
added 
a=er 
Last.fm 
deleted 
a 
fileystem 
by 
accident. 
• HDFS 
is 
primarily 
designed 
for 
streaming 
access 
of 
large 
files. 
Reading 
through 
small 
files 
normally 
causes 
lots 
of 
seeks 
and 
lots 
of 
hopping 
from 
datanode 
to 
datanode 
to 
retrieve 
each 
small 
file. 
• Steep 
learning 
curve. 
According 
to 
Facebook, 
using 
Hadoop 
was 
not 
easy 
for 
end 
users, 
especially 
for 
the 
ones 
who 
were 
not 
familiar 
with 
MapReduce.
Ques6ons? 
Images: 
hip://www.flickr.com/photos/labguest/3509303134 
hip://www.flickr.com/photos/tantrum_dan/3546852841

More Related Content

What's hot

Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the roomcacois
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 

What's hot (20)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Anju
AnjuAnju
Anju
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 

Similar to Distributed Computing Hadoop Framework Process Large Datasets

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 

Similar to Distributed Computing Hadoop Framework Process Large Datasets (20)

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 

Recently uploaded

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Distributed Computing Hadoop Framework Process Large Datasets

  • 1. Federico Cargnelu/ / BSkyB Hadoop & Distributed Compu<ng
  • 2. Distributed compu<ng uses so=ware to divide pieces of a program among several computers. One project in par<cular has proven that the concept works extremely well.
  • 3. SETI@Home Search for Extra-­‐Terrestrial Intelligence • Prove the viability of the distributed grid compu<ng concept (succeeded) • Detect intelligent life outside Earth (failed)
  • 4. Distributed Compu6ng What problem are we trying to solve?
  • 5. Counts of all the dis6nct word • in a file? • in a directory? • on the Web?
  • 6. We need to process 100TB datasets • On 1 node: o Scanning @ 50MB/s = 23 days • On 1000 node cluster: o Scanning @ 50MB/s = 33 min
  • 7. We need a framework for distribu<on
  • 8. We need a new paradigm
  • 9.
  • 10. Hadoop is an open-­‐source Java framework for running applica<ons on large clusters of commodity hardware
  • 11. Scalable Hadoop can reliably store and process petabytes of data. Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes. Efficient Hadoop can process the distributed data in parallel on the nodes where the data is located. Reliable Hadoop automa<cally maintains mul<ple copies of data and automa<cally redeploys compu<ng tasks based on failures.
  • 12. Hadoop Components Hadoop Distributed File System (HDFS) • Java, Shell, C and HTTP API’s Hadoop MapReduce • Java and Streaming API’s Hadoop on Demand • Tools to manage dynamic setup and teardown of Hadoop nodes
  • 13. Other Tools HBase Table storage on top of HDFS, modeled a=er Google’s Big Table Pig Language for dataflow programming Hive SQL interface to structured data stored in HDFS
  • 14. Hadoop MapReduce • Mappers and Reducers are allocated • Code is shipped to nodes • Mappers and Reducers are run on same machines as DataNodes • Two major daemons: JobTracker and TaskTracker
  • 15. Hadoop MapReduce JobTracker • Long-­‐lived master daemon which distributes tasks • Maintains a job history of job execu<on sta<s<cs TaskTrackers • Long-­‐lived client daemon which executes Map and Reduce tasks
  • 16. Hadoop MapReduce • Setup a mul<-­‐node Hadoop cluster using the Hadoop Distributed File System (HDFS) • Create a hierarchical HDFS with directories and files. • Use Hadoop API to store a large text file. • Create a MapReduce applica<on.
  • 17. • Mapper takes input key/value pair • Does something to its input • Emits intermediate key/value pair • One call per input record • Fully data-­‐parallel Map
  • 18. Map (in, 1) (in, 1) (sunt, 1) (in, 1) (elit, 1) (sed, 1) (eiusmod, 1)
  • 19. • Input is all list of intermediate values for a given key • Reducer aggregates list of intermediate values • Returns a final key/value pair for output Reduce
  • 20. Reduce Reduce (irure, 1) (in, 3) (ea, 1) (enim, 1) (eu, 1) (Duis, 1) (dolore, 2)
  • 21. Adobe -­‐ Use for data storage and processing -­‐ 30 nodes Facebook -­‐ Use for repor<ng and analy<cs -­‐ 320 nodes FOX -­‐ Use for log analysis and data mining -­‐ 140 nodes Who is using it? Last.fm -­‐ Use for chart calcula<on and log analysis -­‐ 27 nodes New York Times -­‐ Use for large scale image conversion -­‐ 100 nodes Yahoo! -­‐ Use for Ad systems and Web search -­‐ 10.000 nodes
  • 22. Use Cases • Video and Image processing • Log analysis • Spam/BOT analysis • Behavioral analy<cs (CRM) • Sequen<al paiern analysis (eg. Understanding long-­‐term customer buying behavior for cross selling and target marke<ng)
  • 23. Recommended Hardware Commodity servers • 1 RU • 2 x 4 core CPU • 4-­‐8GB of RAM using ECC memory • 4 x 1TB SATA drives • 1-­‐5TB external storage Typically arranged in 2 level architecture • 30/40 nodes per rack
  • 24. Challenges • No version and dependency management. • Configura<on: more than 150 parameters. • No security against accidents. User iden<fica<on added a=er Last.fm deleted a fileystem by accident. • HDFS is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file. • Steep learning curve. According to Facebook, using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce.
  • 25. Ques6ons? Images: hip://www.flickr.com/photos/labguest/3509303134 hip://www.flickr.com/photos/tantrum_dan/3546852841