SlideShare una empresa de Scribd logo
1 de 34
Apache Hadoop
Sheetal Sharma
Intern At IBM Innovation Centre
Why Data?
Get insights to offer
a better product
“More data usually
beats better
algorithms”
Get of insights to
make better decisions
Avoid “guesstimates”
What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and
high-level language
Fundamental Ideas
A big system of
machines, not a big
machine
Failures will happen
Move computation to
data, not data to
computation
Write complex code
only once, but right
Apache Hadoop
An open-source Java
software
Storing and processing
of very large data sets
A clusters of
commodity machines
A simple programming
model
Apache Hadoop
Two main components:
HDFS - a distributed file
system
MapReduce – a
distributed processing
layer
HDFS
The Purpose Of HDFS
●
Store large datasets
in a distributed,
scalable and fault-
tolerant way
●
High throughput
●
Very large files
●
Streaming reads and writes (no edits)
HDFS Mis-Usage
Do NOT use, if you have
Low-latency
requests
Random
reads and writes
Lots of
small files
Then better to consider
RDBMs,
Splitting Files And
Replicating Blocks
Split a very large file into
smaller (but still large)
blocks
Store them redundantly on
a set of machines
Spiting Files Into Blocks
●
The default block size
is 64MB
●
Minimize the overhead
of a disk seek
operation (less than
1%)
●
A file is just “sliced”
into chunks after each
64MB (or so)
Replicating Blocks
The default
replication factor
is 3
●
It can be changed
per a file or a
directory
●
It can be
Master And Slaves
The Master node keeps and
manages all metadata
information
The Slave nodes store blocks
of data and serve them to
the client
Master node (called
NameNode)
Slave nodes (called DataNodes
Classical* HDFS Cluster
*no NameNode HA, no HDFS
Replication
Manages metadata
Does some
“house-keeping”
operations for
NameNode
Stores and retrieves
blocks of data
HDFS NameNode
Performs all the metadata-
related operations
Keeps information in RAM (for
fast look up)
The file system tree
Metadata for all
files/directories (e.g.
ownership, permissions)
Names and locations of
blocks
HDFS DataNode
Stores and retrieves blocks of
data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to
the NN to say that it is still
alive
Sends a block report to the NN
periodically
HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior
snapshot (fsimage) and editlog(s)
(edits)
Fetches current fsimage and
edits files from the NameNode
Applies edits to fsimage to
create the up-to-date fsimage
Then sends the up-to-date
fsimage back to the NameNode
Reading A File From HDFS
Block data is never sent through the
NameNode
The NameNode redirects a client to an
appropriate DataNode
The NameNode chooses a DataNode that
is as “close” as possible
Lots of data
comes
from DataNodes
to a client
Blocks locations
$ hadoop fs -cat /toplist/2013-05-15/poland.txt
HDFS And Local File System
●
Runs on the top
of a native file
system (e.g. ext3,
ext4, xfs)
●
HDFS is simply a
Java application
that uses a native
HDFS Data Integrity
HDFS detects corrupted
blocks
● When writing
Client computes the
checksums for each block
Client sends checksums to
a DN together with data
● When reading
Client verifies the
HDFS NameNode Scalability
Stats based on Yahoo!
Clusters
●
An average file 1.5≈
blocks (block size = 128
MB)
●
An average file 600≈
bytes in RAM (1 file and 2
blocks objects)
●
100M files 60 GB of≈
metadata
HDFS NameNode
Performance
Read/write operations
throughput limited by one
machine
●
~120K read ops/sec
●
~6K write ops/sec
●
MapReduce tasks are also
HDFS clients
Internal load increases as
the cluster grows
●
HDFS Main Limitations
Single NameNode
●
Keeps all
metadata in RAM
●
Performs all
metadata
operations
●
Becomes a single
MapReduce
MapReduce Model
Programming model
inspired by functional
programming
map() and reduce()
functions processing
<key, value> pairs
Useful for processing
Map And Reduce Functions
● Map and Reduce
Map And Reduce Functions -
Counting Word
MapReduce Job
Input data is divided
into
splits and converted
into
<key, value> pairs
Invokes map() function
multiple times
Keys are
sorted,
values not
(but
could be)
Invokes reduce()
Function multiple times
MapReduce Example: ArtistCount
Artist, Song, Timestamp, User
Key is the offset of the line
from the beginning
of the line
We could specify which artist
goes to which reducer
(HashParitioner is default one)
MapReduce Example:
ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context
context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing
language ;)
MapReduce Combiner
Make sure that the Combiner
combines fast and enough
(otherwise it adds overhead
only)
MapReduce Implementation
●
Batch processing system
●
Automatic parallelization
and distribution of
computation
●
Fault-tolerance
●
Deals with all messy
details related to
distributed processing
●
Relatively easy to use
for programmers
JobTracker Reponsibilities
●
Manages the
computational
resources
Available
TaskTrackers, map and
reduce slots
●
Schedules all user
jobs
Schedules all
TaskTracker Reponsibilities
●
Runs map and reduce
tasks
●
Reports to JobTracker
Heartbeats saying
that it is still alive
Number of free
map and reduce slots
Task progress,
Apache Hadoop Cluster
●
It can consists of 1, 5,
100 and 4000 nodes
Thank You!

Más contenido relacionado

La actualidad más candente

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi namboori
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystemrohitraj268
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemAnand Kulkarni
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen
 

La actualidad más candente (20)

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop
HadoopHadoop
Hadoop
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 

Destacado

використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...Lyudmila Boyko
 
7. атестація кадрів
7. атестація кадрів7. атестація кадрів
7. атестація кадрівLyudmila Boyko
 
Аналіз виховної роботи,
Аналіз виховної роботи,Аналіз виховної роботи,
Аналіз виховної роботи,Lyudmila Boyko
 
Аналіз методичної роботи
Аналіз методичної роботиАналіз методичної роботи
Аналіз методичної роботиLyudmila Boyko
 
Magazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région BordelaiseMagazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région BordelaiseFanny Rousselon
 
Lean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustryLean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustryepomajar
 
8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручників8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручниківLyudmila Boyko
 

Destacado (14)

використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...використання досягнень науки у системі роботи вчителя – основа розвитку творч...
використання досягнень науки у системі роботи вчителя – основа розвитку творч...
 
Ganesan resume
Ganesan resumeGanesan resume
Ganesan resume
 
Rockagent
RockagentRockagent
Rockagent
 
7. атестація кадрів
7. атестація кадрів7. атестація кадрів
7. атестація кадрів
 
Аналіз виховної роботи,
Аналіз виховної роботи,Аналіз виховної роботи,
Аналіз виховної роботи,
 
Аналіз методичної роботи
Аналіз методичної роботиАналіз методичної роботи
Аналіз методичної роботи
 
Resume_2015
Resume_2015Resume_2015
Resume_2015
 
Gr lr world_042015
Gr lr world_042015Gr lr world_042015
Gr lr world_042015
 
Magazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région BordelaiseMagazine des programmes immobiliers neufs dans la région Bordelaise
Magazine des programmes immobiliers neufs dans la région Bordelaise
 
Lean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustryLean sixsigmausedinmyindustry
Lean sixsigmausedinmyindustry
 
8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручників8. моніторингові дослідження. апробація підручників
8. моніторингові дослідження. апробація підручників
 
Thali
ThaliThali
Thali
 
eusim unlimited call to eu
 eusim unlimited call to eu  eusim unlimited call to eu
eusim unlimited call to eu
 
Beeali smart phones 1
Beeali smart phones 1Beeali smart phones 1
Beeali smart phones 1
 

Similar a Apache hadoop

Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 

Similar a Apache hadoop (20)

Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop
HadoopHadoop
Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop
HadoopHadoop
Hadoop
 

Más de sheetal sharma

Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticssheetal sharma
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightsheetal sharma
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Servicessheetal sharma
 

Más de sheetal sharma (9)

Db import&amp;export
Db import&amp;exportDb import&amp;export
Db import&amp;export
 
Db import&amp;export
Db import&amp;exportDb import&amp;export
Db import&amp;export
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Services
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Apache hadoop

  • 1. Apache Hadoop Sheetal Sharma Intern At IBM Innovation Centre
  • 2. Why Data? Get insights to offer a better product “More data usually beats better algorithms” Get of insights to make better decisions Avoid “guesstimates”
  • 3. What Is Challenging? Store data reliably Analyze data quickly Cost-effective way Use expressible and high-level language
  • 4. Fundamental Ideas A big system of machines, not a big machine Failures will happen Move computation to data, not data to computation Write complex code only once, but right
  • 5. Apache Hadoop An open-source Java software Storing and processing of very large data sets A clusters of commodity machines A simple programming model
  • 6. Apache Hadoop Two main components: HDFS - a distributed file system MapReduce – a distributed processing layer
  • 7. HDFS The Purpose Of HDFS ● Store large datasets in a distributed, scalable and fault- tolerant way ● High throughput ● Very large files ● Streaming reads and writes (no edits)
  • 8. HDFS Mis-Usage Do NOT use, if you have Low-latency requests Random reads and writes Lots of small files Then better to consider RDBMs,
  • 9. Splitting Files And Replicating Blocks Split a very large file into smaller (but still large) blocks Store them redundantly on a set of machines
  • 10. Spiting Files Into Blocks ● The default block size is 64MB ● Minimize the overhead of a disk seek operation (less than 1%) ● A file is just “sliced” into chunks after each 64MB (or so)
  • 11. Replicating Blocks The default replication factor is 3 ● It can be changed per a file or a directory ● It can be
  • 12. Master And Slaves The Master node keeps and manages all metadata information The Slave nodes store blocks of data and serve them to the client Master node (called NameNode) Slave nodes (called DataNodes
  • 13. Classical* HDFS Cluster *no NameNode HA, no HDFS Replication Manages metadata Does some “house-keeping” operations for NameNode Stores and retrieves blocks of data
  • 14. HDFS NameNode Performs all the metadata- related operations Keeps information in RAM (for fast look up) The file system tree Metadata for all files/directories (e.g. ownership, permissions) Names and locations of blocks
  • 15. HDFS DataNode Stores and retrieves blocks of data Data is stored as regular files on a local filesystem (e.g. ext4) e.g. blk_-992391354910561645 (+ checksums in a separate file) A block itself does not know which file it belongs to! Sends a heartbeat message to the NN to say that it is still alive Sends a block report to the NN periodically
  • 16. HDFS Secondary NameNode NOT a failover NameNode Periodically merges a prior snapshot (fsimage) and editlog(s) (edits) Fetches current fsimage and edits files from the NameNode Applies edits to fsimage to create the up-to-date fsimage Then sends the up-to-date fsimage back to the NameNode
  • 17. Reading A File From HDFS Block data is never sent through the NameNode The NameNode redirects a client to an appropriate DataNode The NameNode chooses a DataNode that is as “close” as possible Lots of data comes from DataNodes to a client Blocks locations $ hadoop fs -cat /toplist/2013-05-15/poland.txt
  • 18. HDFS And Local File System ● Runs on the top of a native file system (e.g. ext3, ext4, xfs) ● HDFS is simply a Java application that uses a native
  • 19. HDFS Data Integrity HDFS detects corrupted blocks ● When writing Client computes the checksums for each block Client sends checksums to a DN together with data ● When reading Client verifies the
  • 20. HDFS NameNode Scalability Stats based on Yahoo! Clusters ● An average file 1.5≈ blocks (block size = 128 MB) ● An average file 600≈ bytes in RAM (1 file and 2 blocks objects) ● 100M files 60 GB of≈ metadata
  • 21. HDFS NameNode Performance Read/write operations throughput limited by one machine ● ~120K read ops/sec ● ~6K write ops/sec ● MapReduce tasks are also HDFS clients Internal load increases as the cluster grows ●
  • 22. HDFS Main Limitations Single NameNode ● Keeps all metadata in RAM ● Performs all metadata operations ● Becomes a single
  • 23. MapReduce MapReduce Model Programming model inspired by functional programming map() and reduce() functions processing <key, value> pairs Useful for processing
  • 24. Map And Reduce Functions ● Map and Reduce
  • 25. Map And Reduce Functions - Counting Word
  • 26. MapReduce Job Input data is divided into splits and converted into <key, value> pairs Invokes map() function multiple times Keys are sorted, values not (but could be) Invokes reduce() Function multiple times
  • 27. MapReduce Example: ArtistCount Artist, Song, Timestamp, User Key is the offset of the line from the beginning of the line We could specify which artist goes to which reducer (HashParitioner is default one)
  • 28. MapReduce Example: ArtistCount map(Integer key, EndSong value, Context context): context.write(value.artist, 1) reduce(String key, Iterator<Integer> values, Context context): int count = 0 for each v in values: count += v context.write(key, count) Pseudo-code in non-existing language ;)
  • 29. MapReduce Combiner Make sure that the Combiner combines fast and enough (otherwise it adds overhead only)
  • 30. MapReduce Implementation ● Batch processing system ● Automatic parallelization and distribution of computation ● Fault-tolerance ● Deals with all messy details related to distributed processing ● Relatively easy to use for programmers
  • 31. JobTracker Reponsibilities ● Manages the computational resources Available TaskTrackers, map and reduce slots ● Schedules all user jobs Schedules all
  • 32. TaskTracker Reponsibilities ● Runs map and reduce tasks ● Reports to JobTracker Heartbeats saying that it is still alive Number of free map and reduce slots Task progress,
  • 33. Apache Hadoop Cluster ● It can consists of 1, 5, 100 and 4000 nodes