SlideShare una empresa de Scribd logo
1 de 25
HadoopUnderstanding HDFS, MR & YARN
Hadoop
 Hadoop is an open-source software framework for distributed storage and
distributed processing of large structured, semi-structured, and unstructured data
sets across clusters of commodity servers
Google MR
Hadoop
becomes
Apaches top
level project
Yahoo!'s 4000 node cluster followed by Facebook's 2300 node cluster
are the largest clusters
FB
launche
s hive
Nutch:
Doug Cutting &
Mike Cafarella
NDFS &
MR to
N tch
Doug
Cutting
Joins
Yahoo!
Hadoop
Subproject
of Lucene
Spins out of
Nutch
Yahoo!
Fastest
sort of a
TB, 910
nodes, 3.5
mins
Google GFS
2002 2003 2004 2005 2006 2007 2008 2009
NY Times
converts 4
TB of Image
archives
over 100
EC2s
Cloudera
founded
Doug Cutting
joined
cloudera
Fastest sort of a
TB, 62 secs over
1460 nodes.
Petabyte Sort :
hrs: 16.25
Nodes :3658
Hadoop core components
HDFS
MapReduce
Hadoop Distributed File System
Programming model, Distributed processing
engine
YARN
(MRv2)
Yet Another Resourced Negotiator
Resource Management/Central Operating
platform
Design of HDFS
 Designed for
 Very Large Files
 Streaming Data Access
 Commodity Hardware
 Not meant for
 Low Latency data access
 Lots of Small Files
 Multiple Writers, arbitrary file modifications
Hadoop Storage: HDFS Architecture
 Datanodes, Block Replication, Namenode[FsImage, Edits log]
 Block Replication/Data Replication determines how redundant data is stored in hdfs
 Replication factor determins the number of copies
 2 is store one copy on different rack
 3 is store one copy on different rack and one on same rack
 Datanodes store the actual data
 stored as blocks
 the size of blocks can be tuned
 default is usually 64 or 128MB
 smaller the block size(the more blocks) the more the namenode would have to manage
 Namenode manages block locations
 stores "metadata"
 names nodes are a point of failure
 RAM is important here
RACK2
HDFS Architecture
Client
DataNode
b1
b2
b3
RACK 1
DataNode
b1
b2
b3
DataNode
b2
b3
Read
Block ops
[heartbeat, block info]
Client
write
Replications
DataNode
b1b5
b5
Namenode active
Metadata
File/directory name, permissions, ownerships, assigned
blocks
/user/foo/data,3,rw-rw-r--,dev:hdfs,
Hadoop 2.x Cluster Architecture
 ResourceManager
 Master that arbitrates all the available cluster resources
 ApplicationMaster
 Negotiates resources with the ResourceManager and for working with the
NodeManagers to start the containers.
 Is the middleman between NM and RM
 Allows for greater scalability
 NodeManager
 Takes instructions from the ResourceManager and manage resources
available on a single node.
Federation
 allows for multiple namespaces
 separation of namespace and storage
 Namespace: manages directories, files and blocks. It supports file system
operations such as creation, modification, deletion and listing of files and
directories.
 Block Storage: It supports block-related operations such as creation,
deletion, modification and getting location of the blocks. It also takes care
of replica placement and replication. stores the blocks and provides
read/write access to it.
 improve scalability and isolation
 without federation namespace does not scale as easily
HDFS Federation
Hadoop 1.0
Datanode 1
Namenode
Block Management
NS1
Datanode n
Hadoop 2.0
Block Pool
Datanode 1
NN 1
Pool1
NS1
NN 2
Pool2
NS2
NN n
Pooln
NS n
Datanode 2 Datanode n
Blockstorage
HDFS FED Example
Hadoop 2.0
Datanode 1
NN 1
NS1
/user/data/et
l
NN 2
NS2
/user/data/x
ml
NN n
NS n
/home/strea
ming/data/w
eather
Datanode 2 Datanode n
HA
 Prior to Hadoop 2.0 –
 One NameNode for metadata management
 Single point of failure
 HDFS High Availability –
 Two NameNodes in the same cluster
 Active NameNode: responsive for all client operations
 Standby NameNode: maintain enough state to provide a fast failover
 Shared storage
 Active NN writes edit log
 Standby NN reads edit log and applies to its own namespace
 During failover, Standby NN reads all the edits and transitions to Active state
High Availability
http://www.slideshare.net/cloudera/ha-phase-2-with-atm-updates
Anatomy of File Write
NameNode
2. create
4
5
4
5
RACK 2
DataNode
b1
b2
b3
7. Complete
Client Node
Client JVM
HDFS Client DistributedFileSystem
FSDataOutputStream
1. create
3. write
6. close
RACK 1
DataNode
b1
b2
b3
DataNode
b2
b3
4. Write Packet 5. Ack Packet
Pipeline of datanodes
Anatomy of File Read
NameNode
2. Get Block locations
5. read
RACK 1
DataNode
b1
b2
b3
DataNode
b2
b3
RACK 2
DataNode
b1
b2
b3
Client Node
Client JVM
HDFS Client DistributedFileSystem
FSDataInputStream
1.open
3. read
6 .close
4. read
Map Reduce 1
Client Node
job tracker Node
JobTracker
HDFS
tasktracker Node
TaskTracker
Child JVM
child
Map or Reduce Task
Client JVM
MR job
1. Run job
2. Get new Job ID
4. Submit job
3. Copy job
resources
6. Retrieve input splits
6. Retrieve job
resources
6. heartbeat
9. launch
10. run
5. Init job
YARN (Map Reduce 2)
ResourceManager
1. Run job 4. submit applications
9a : start container7. Retrieve input splits
Client Node
Client JVM
MR job
HDFS
tasktracker Node
Node Manager
task JVM
Yarn child
Map or Reduce Task
2. Get new application id
3. Copy job resources
9b. launch
11. run
tasktracker Node
Node Manager
MR App Master
5a : start container
8. Allocate resources
10. Retrieve job resources
6. Init job
5b. launch
Coherency Model
 First block is visible to read once more than a block’s worth of data is
written
 The current block is the one that’s not always visible to reader
map reduce
cats, dogs, cows,
cats, dogs, dogs,
cows, cats
cats, dogs, cows
cows, cats
cats, dogs, dogs
cats, 1
dogs, 1
cows, 1
cats, 1
dogs, 1
dogs 1
cats, 1
cows, 1
cats, 1
cats, 1
cats, 1
cows, 1
cows, 1
dogs, 1
dogs, 1
dogs, 1
cats, 3
cows, 2
dogs, 3
cats,3
cows,2
dogs,3
input split map shuffle reduce output
DataNode
HDFS
InputSplit
Memory
Buffer
p1 p2
p1 p2 p3
p3 p2
p1
p2
p3
Map 1
DataNode
p1
p1
Reduce
DataNode
p2
p2
Reduce
p1
p2
• Intermediate
map output
files.
• Sorted by key
• Part-m-00000
• Combine()
• Spills data to disk.
• Partitions data.
• Sorts by key
• Map takes <k,v>.
• Applies map() in <k,v>
• Writes the o/p to
mem
Merge.1file/partition
Sort/merge Reduce
Output
HDFS
Output
HDFS
DataNode n
p1 p4
p1 p2 p4
p4
p1
p2
p4
HDFS
InputSplit
Memory
Buffer
Map 1
100
MB
shuffleMap [o/p is sorted by key]
MR gotchas
 Map takes input splits as key Value pairs
 Output from mapper is always sorted but based on Key.
 context.write(outKey, outValue);
 then result will be sorted based on outKey
 Default partition is hashing keys
 Reducer reduces a set of intermediate values which share a key to a smaller set
of values.
 reduce() function is called for each key
 setNumOfReducetasks(0)
Hadoop Distributions
 Cloudera
 HortonWorks
 MapR
 Pivotal
 Microsoft Azure HDInsight
 IBM Biginsights
 https://www.cloudera.com/content/dam/www/static/documents/analyst-
reports/forrester-wave-big-data-hadoop-distributions.pdf
What’s next
 Hadoop IO
 Serialization
 Avro, Sequence, Map Files
 File Formats
 Text, Binary, XML, DB
 Joins
 Map-side, Reduce-side
 Secondary Sorting
 Side Data Distribution
 Distributed Cache
 Using jobconfig
References
 Hadoop: The Definitive Guide
 https://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-core/MapReduceTutorial.html
 http://stackoverflow.com/questions/24771006/is-the-output-of-map-
phase-of-the-mapreduce-job-always-sorted
 http://www.slideshare.net/AdamKawa/apache-hadoop-yarn-namenode-
ha-hdfs-federation
 http://www.datanami.com/2016/05/11/open-source-tour-de-force-
apache-big-data-2016/
 Questions
 Comments

Más contenido relacionado

La actualidad más candente

Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFSApache Apex
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 

La actualidad más candente (20)

Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
HDFS_Command_Reference
HDFS_Command_ReferenceHDFS_Command_Reference
HDFS_Command_Reference
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 

Similar a Understanding Hadoop

Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop security
Hadoop securityHadoop security
Hadoop securityBiju Nair
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 

Similar a Understanding Hadoop (20)

Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

Último

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Understanding Hadoop

  • 2. Hadoop  Hadoop is an open-source software framework for distributed storage and distributed processing of large structured, semi-structured, and unstructured data sets across clusters of commodity servers
  • 3. Google MR Hadoop becomes Apaches top level project Yahoo!'s 4000 node cluster followed by Facebook's 2300 node cluster are the largest clusters FB launche s hive Nutch: Doug Cutting & Mike Cafarella NDFS & MR to N tch Doug Cutting Joins Yahoo! Hadoop Subproject of Lucene Spins out of Nutch Yahoo! Fastest sort of a TB, 910 nodes, 3.5 mins Google GFS 2002 2003 2004 2005 2006 2007 2008 2009 NY Times converts 4 TB of Image archives over 100 EC2s Cloudera founded Doug Cutting joined cloudera Fastest sort of a TB, 62 secs over 1460 nodes. Petabyte Sort : hrs: 16.25 Nodes :3658
  • 4. Hadoop core components HDFS MapReduce Hadoop Distributed File System Programming model, Distributed processing engine YARN (MRv2) Yet Another Resourced Negotiator Resource Management/Central Operating platform
  • 5. Design of HDFS  Designed for  Very Large Files  Streaming Data Access  Commodity Hardware  Not meant for  Low Latency data access  Lots of Small Files  Multiple Writers, arbitrary file modifications
  • 6. Hadoop Storage: HDFS Architecture  Datanodes, Block Replication, Namenode[FsImage, Edits log]  Block Replication/Data Replication determines how redundant data is stored in hdfs  Replication factor determins the number of copies  2 is store one copy on different rack  3 is store one copy on different rack and one on same rack  Datanodes store the actual data  stored as blocks  the size of blocks can be tuned  default is usually 64 or 128MB  smaller the block size(the more blocks) the more the namenode would have to manage  Namenode manages block locations  stores "metadata"  names nodes are a point of failure  RAM is important here
  • 7. RACK2 HDFS Architecture Client DataNode b1 b2 b3 RACK 1 DataNode b1 b2 b3 DataNode b2 b3 Read Block ops [heartbeat, block info] Client write Replications DataNode b1b5 b5 Namenode active Metadata File/directory name, permissions, ownerships, assigned blocks /user/foo/data,3,rw-rw-r--,dev:hdfs,
  • 8. Hadoop 2.x Cluster Architecture  ResourceManager  Master that arbitrates all the available cluster resources  ApplicationMaster  Negotiates resources with the ResourceManager and for working with the NodeManagers to start the containers.  Is the middleman between NM and RM  Allows for greater scalability  NodeManager  Takes instructions from the ResourceManager and manage resources available on a single node.
  • 9. Federation  allows for multiple namespaces  separation of namespace and storage  Namespace: manages directories, files and blocks. It supports file system operations such as creation, modification, deletion and listing of files and directories.  Block Storage: It supports block-related operations such as creation, deletion, modification and getting location of the blocks. It also takes care of replica placement and replication. stores the blocks and provides read/write access to it.  improve scalability and isolation  without federation namespace does not scale as easily
  • 10. HDFS Federation Hadoop 1.0 Datanode 1 Namenode Block Management NS1 Datanode n Hadoop 2.0 Block Pool Datanode 1 NN 1 Pool1 NS1 NN 2 Pool2 NS2 NN n Pooln NS n Datanode 2 Datanode n Blockstorage
  • 11. HDFS FED Example Hadoop 2.0 Datanode 1 NN 1 NS1 /user/data/et l NN 2 NS2 /user/data/x ml NN n NS n /home/strea ming/data/w eather Datanode 2 Datanode n
  • 12. HA  Prior to Hadoop 2.0 –  One NameNode for metadata management  Single point of failure  HDFS High Availability –  Two NameNodes in the same cluster  Active NameNode: responsive for all client operations  Standby NameNode: maintain enough state to provide a fast failover  Shared storage  Active NN writes edit log  Standby NN reads edit log and applies to its own namespace  During failover, Standby NN reads all the edits and transitions to Active state
  • 14. Anatomy of File Write NameNode 2. create 4 5 4 5 RACK 2 DataNode b1 b2 b3 7. Complete Client Node Client JVM HDFS Client DistributedFileSystem FSDataOutputStream 1. create 3. write 6. close RACK 1 DataNode b1 b2 b3 DataNode b2 b3 4. Write Packet 5. Ack Packet Pipeline of datanodes
  • 15. Anatomy of File Read NameNode 2. Get Block locations 5. read RACK 1 DataNode b1 b2 b3 DataNode b2 b3 RACK 2 DataNode b1 b2 b3 Client Node Client JVM HDFS Client DistributedFileSystem FSDataInputStream 1.open 3. read 6 .close 4. read
  • 16. Map Reduce 1 Client Node job tracker Node JobTracker HDFS tasktracker Node TaskTracker Child JVM child Map or Reduce Task Client JVM MR job 1. Run job 2. Get new Job ID 4. Submit job 3. Copy job resources 6. Retrieve input splits 6. Retrieve job resources 6. heartbeat 9. launch 10. run 5. Init job
  • 17. YARN (Map Reduce 2) ResourceManager 1. Run job 4. submit applications 9a : start container7. Retrieve input splits Client Node Client JVM MR job HDFS tasktracker Node Node Manager task JVM Yarn child Map or Reduce Task 2. Get new application id 3. Copy job resources 9b. launch 11. run tasktracker Node Node Manager MR App Master 5a : start container 8. Allocate resources 10. Retrieve job resources 6. Init job 5b. launch
  • 18. Coherency Model  First block is visible to read once more than a block’s worth of data is written  The current block is the one that’s not always visible to reader
  • 19. map reduce cats, dogs, cows, cats, dogs, dogs, cows, cats cats, dogs, cows cows, cats cats, dogs, dogs cats, 1 dogs, 1 cows, 1 cats, 1 dogs, 1 dogs 1 cats, 1 cows, 1 cats, 1 cats, 1 cats, 1 cows, 1 cows, 1 dogs, 1 dogs, 1 dogs, 1 cats, 3 cows, 2 dogs, 3 cats,3 cows,2 dogs,3 input split map shuffle reduce output
  • 20. DataNode HDFS InputSplit Memory Buffer p1 p2 p1 p2 p3 p3 p2 p1 p2 p3 Map 1 DataNode p1 p1 Reduce DataNode p2 p2 Reduce p1 p2 • Intermediate map output files. • Sorted by key • Part-m-00000 • Combine() • Spills data to disk. • Partitions data. • Sorts by key • Map takes <k,v>. • Applies map() in <k,v> • Writes the o/p to mem Merge.1file/partition Sort/merge Reduce Output HDFS Output HDFS DataNode n p1 p4 p1 p2 p4 p4 p1 p2 p4 HDFS InputSplit Memory Buffer Map 1 100 MB shuffleMap [o/p is sorted by key]
  • 21. MR gotchas  Map takes input splits as key Value pairs  Output from mapper is always sorted but based on Key.  context.write(outKey, outValue);  then result will be sorted based on outKey  Default partition is hashing keys  Reducer reduces a set of intermediate values which share a key to a smaller set of values.  reduce() function is called for each key  setNumOfReducetasks(0)
  • 22. Hadoop Distributions  Cloudera  HortonWorks  MapR  Pivotal  Microsoft Azure HDInsight  IBM Biginsights  https://www.cloudera.com/content/dam/www/static/documents/analyst- reports/forrester-wave-big-data-hadoop-distributions.pdf
  • 23. What’s next  Hadoop IO  Serialization  Avro, Sequence, Map Files  File Formats  Text, Binary, XML, DB  Joins  Map-side, Reduce-side  Secondary Sorting  Side Data Distribution  Distributed Cache  Using jobconfig
  • 24. References  Hadoop: The Definitive Guide  https://hadoop.apache.org/docs/current/hadoop-mapreduce- client/hadoop-mapreduce-client-core/MapReduceTutorial.html  http://stackoverflow.com/questions/24771006/is-the-output-of-map- phase-of-the-mapreduce-job-always-sorted  http://www.slideshare.net/AdamKawa/apache-hadoop-yarn-namenode- ha-hdfs-federation  http://www.datanami.com/2016/05/11/open-source-tour-de-force- apache-big-data-2016/