SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
Hadoop and
MapReduce
                               Friso van Vollenhoven
                               fvanvollenhoven@xebia.com


The workings of the elephant
Data everywhere

‣ Global data volume grows exponentially
‣ Information retrieval is BIG business these days
‣ Need means of economically storing and processing large data sets
Opportunity

‣ Commodity hardware is ultra cheap
‣ CPU and storage even cheaper
Traditional solution

‣ Store data in a (relational) database
‣ Run batch jobs for processing
Problems with existing solutions

‣ Databases are seek heavy; B-tree gives log(n) random accesses per update
‣ Seeks are wasted time, nothing of value happens during seeks
‣ Databases do not play well with commoditized hardware (SANs and 16 CPU
    machines are not in the price sweet spot of performance / $)
‣   Databases were not built with horizontal scaling in mind
Solution: sort/merge vs. updating the B-tree

‣   Eliminate the seeks, only sequential reading / writing
‣   Work with batches for efficiency
‣   Parallelize work load
‣   Distribute processing and storage
History

‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index
‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge
    optimization applies
‣   2004: Google publishes GFS and MapReduce papers
‣   2006: Apache Hadoop: open source Java implementation of GFS and MR to solve
    Nutch’ problem; later becomes standalone project
‣   2011: We’re here learning about it!
Hadoop foundations

‣   Commodity hardware (3K - 7K $ machines)
‣   Only sequential reads / writes
‣   Distribution of data and processing across cluster
‣   Built in reliability / fault tolerance / redundancy
‣   Disk based, does not require data or indexes to fit in RAM
‣   Apache licensed, Open Source Software
The US government
builds their finger print
search index using
Hadoop.
The contents for the People You May Know feature is
created by a chain of many MapReduce jobs that
run daily. The jobs are reportedly a combination of
graph traversal, clustering and assisted machine
learning.
Amazon’s Frequently Bought Together and Customers Who Bought This Item Also
Bought features are brought to you by MapReduce jobs. Recommendation
based on large sales transaction datasets is a much seen use case.
Top Charts
generated daily
based on millions
of users’ listening
behavior.
Top searches used for auto-completion are re-generated daily by a
MapReduce job using all searches for the past couple of days.
Popularity for search terms can be based on counts, but also trending
and correlation with other datasets (e.g. trending on social media,
news, charts in case of music and movies, best seller lists, etc.)
What is Hadoop
Hadoop
Filesystem
             Friso van Vollenhoven
             fvanvollenhoven@xebia.com


HDFS
HDFS overview

‣   Distributed filesystem
‣   Consists of a single master node and multiple (many) data nodes
‣   Files are split up blocks (typically 64MB)
‣   Blocks are spread across data nodes in the cluster
‣   Each block is replicated multiple times to different data nodes in the cluster
    (typically 3 times)
‣   Master node keeps track of which blocks belong to a file
HDFS interaction

‣   Accessible through Java API
‣   FUSE (filesystem in user space) driver available to mount as regular FS
‣   C API available
‣   Basic command line tools in Hadoop distribution
‣   Web interface
HDFS interaction

‣ File creation, directory listing and other meta data actions go through the master
    node (e.g. ls, du, fsck, create file)
‣   Data goes directly to and from data nodes (read, write, append)
‣   Local read path optimization: clients located on same machine as data node will
    always access local replica when possible
Hadoop FileSystem (HDFS)
                                                                              Name Node


                                                                 /some/file               /foo/bar
       HDFS client
                                 create file




                                                                                                                 read data
                                              Date Node                   Date Node                  Date Node
                            write data
                                                DISK                          DISK                     DISK



                                                                                                                  Node local
                                                                                                                  HDFS client
                                                DISK                          DISK                     DISK




                                                          replicate
                                                DISK                          DISK                     DISK




                     read data
HDFS daemons: NameNode

‣   Filesystem master node
‣   Keeps track of directories, files and block locations
‣   Assigns blocks to data nodes
‣   Keeps track of live nodes (through heartbeats)
‣   Initiates re-replication in case of data node loss

‣ Block meta data is held in memory
  • Will run out of memory when too many files exist
‣ Is a SINGLE POINT OF FAILURE in the system
  • Some solutions exist
HDFS daemons: DataNode

‣ Filesystem worker node / “Block server”
‣ Uses underlying regular FS for storage (e.g. ext3)
  • Takes care of distribution of blocks across disks
  • Don’t use RAID
  • More disks means more IO throughput
‣ Sends heartbeats to NameNode
‣ Reports blocks to NameNode (on startup)
‣ Does not know about the rest of the cluster (shared nothing)
Things to know about HDFS

‣ HDFS is write once, read many
  • But has append support in newer versions
‣ Has built in compression at the block level
‣ Does end-to-end checksumming on all data
‣ Has tools for parallelized copying of large amounts of data to other HDFS
    clusters (distcp)
‣   Provides a convenient file format to gather lots of small files into a single large
    one
    • Remember the NameNode running out of memory with too many files?
‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used
    for batch operations
    • Optimized for sequential reads, not random access
Hadoop Sequence Files

‣   Special type of file to store Key-Value pairs
‣   Stores keys and values as byte arrays
‣   Uses length encoded bytes as format
‣   Often used as input or output format for MapReduce jobs
‣   Has built in compression on values
Example: command directory listing



friso@fvv:~/java$ hadoop fs -ls /
Found 3 items
drwxr-xr-x    - friso supergroup    0 2011-03-31 17:06 /Users
drwxr-xr-x    - friso supergroup    0 2011-03-16 14:16 /hbase
drwxr-xr-x    - friso supergroup    0 2011-04-18 11:33 /user
friso@fvv:~/java$
Example: NameNode web interface
Example: copy local file to HDFS




friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json
MapReduce

                           Friso van Vollenhoven
                           fvanvollenhoven@xebia.com

Massively parallelizable
computing
MapReduce, the algorithm
   Input data:             Required output:
Map: extract something useful from each record
                 KEYS   VALUES


           map
                                 void map(recordNumber, record) {
                                   key = record.findColorfulShape();
           map                     value = record.findGrayShapes();
                                   emit(key, value);
           map                   }
           map



           map



           map



           map



           map
Framework sorts all KeyValue pairs by Key
                KEYS   VALUES   KEYS   VALUES
Reduce: process values for each key
KEYS   VALUES              KEYS   VALUES



                reduce




                  reduce




                                   void reduce(key, values) {
                  reduce             allGrayShapes = [];
                                     foreach (value in values) {
                                       allGrayShapes.push(value);
                                     }
                                     emit(key, allGrayShapes);
                                   }
MapReduce, the algorithm

               KEYS   VALUES   KEYS   VALUES              KEYS   VALUES

        map
                                               reduce

         map



        map

                                                 reduce
        map



         map
                                                 reduce

         map



        map



         map
Hadoop MapReduce: parallelized on top of HDFS

‣ Job input comes from files on HDFS
  • Typically sequence files
  • Other formats are possible; requires specialized InputFormat implementation
  • Built in support for text files (convenient for logs, csv, etc.)
  • Files must be splittable for parallelization to work
    - Not all compression formats have this property (e.g. gzip)
MapReduce daemons: JobTracker

‣   MapReduce master node
‣   Takes care of scheduling and job submission
‣   Splits jobs into tasks (Mappers and Reducers)
‣   Assigns tasks to worker nodes
‣   Reassigns tasks in case of failure
‣   Keeps track of job progress
‣   Keeps track of worker nodes through heartbeats
MapReduce daemons: TaskTracker

‣   MapReduce worker process
‣   Starts Mappers en Reducers assigned by JobTracker
‣   Sends heart beats to the JobTracker
‣   Sends task progress to the JobTracker
‣   Does not know about the rest of the cluster (shared nothing)
Hadoop MapReduce: parallelized on top of HDFS
Hadoop MapReduce: Mapper side

‣ Each mapper processes a piece of the total input
  • Typically blocks that reside on the same machine as the mapper (local
      datanode)
‣   Mappers sort output by key and store it on the local disk
    • If the mapper output does not fit in RAM, on disk merge sort happens
Hadoop MapReduce: Reducer side

‣ Reducers collect sorted input KeyValue pairs over the network from Mappers
  • Reducer performs (on disk) merge on inputs from different mappers
‣ Reducer calls the reduce method for each unique key
  • List of values for each key is read from local disk (the result of the merge)
  • Values do not need to fit in RAM
    - Reduce methods that need a global view, need enough RAM to fit all values
       for a key

‣ Reducer writes output KeyValue pairs to HDFS
  • Typically blocks go to local data node
Hadoop MapReduce: parallelized on top of HDFS
<PLUG>
                           Summer Classes
   Big data crunching using Hadoop and other NoSQL tools
   •   Write Hadoop MapReduce jobs in Java
   •   Run on a actual cluster pre-loaded with several datasets
   •   Create a simple application or visualization with the result
   •   Learn about Hadoop without the hassle of building a production cluster first
   •   Have lots of fun!

                 Dates: July 12, August 10
             Only € 295,= for a full day course
                   http://www.xebia.com/summerclasses/bigdata

                                                                            </PLUG>

Más contenido relacionado

La actualidad más candente

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

La actualidad más candente (16)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Cppt
CpptCppt
Cppt
 

Destacado

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetupgethue
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Jonathan Seidman
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopGetInData
 

Destacado (20)

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Amazon Elastic Computing 2
Amazon Elastic Computing 2Amazon Elastic Computing 2
Amazon Elastic Computing 2
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Taller hadoop
Taller hadoopTaller hadoop
Taller hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011Distributed Data Analysis with Hadoop and R - Strangeloop 2011
Distributed Data Analysis with Hadoop and R - Strangeloop 2011
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 

Similar a Hadoop, HDFS and MapReduce

Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialPranamesh Chakraborty
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 

Similar a Hadoop, HDFS and MapReduce (20)

Hadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorialHadoop, Map Reduce and Apache Pig tutorial
Hadoop, Map Reduce and Apache Pig tutorial
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
MapReduce1.pptx
MapReduce1.pptxMapReduce1.pptx
MapReduce1.pptx
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 

Más de fvanvollenhoven

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!fvanvollenhoven
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collectorfvanvollenhoven
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentationfvanvollenhoven
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jfvanvollenhoven
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksfvanvollenhoven
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 

Más de fvanvollenhoven (9)

Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!Xebicon 2015 - Go Data Driven NOW!
Xebicon 2015 - Go Data Driven NOW!
 
Prototyping online ML with Divolte Collector
Prototyping online ML with Divolte CollectorPrototyping online ML with Divolte Collector
Prototyping online ML with Divolte Collector
 
Divolte Collector - meetup presentation
Divolte Collector - meetup presentationDivolte Collector - meetup presentation
Divolte Collector - meetup presentation
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
Network analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4jNetwork analysis with Hadoop and Neo4j
Network analysis with Hadoop and Neo4j
 
NoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networksNoSQL War Stories preso: Hadoop and Neo4j for networks
NoSQL War Stories preso: Hadoop and Neo4j for networks
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Berlin Buzzwords preso
Berlin Buzzwords presoBerlin Buzzwords preso
Berlin Buzzwords preso
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Hadoop, HDFS and MapReduce

  • 1. Hadoop and MapReduce Friso van Vollenhoven fvanvollenhoven@xebia.com The workings of the elephant
  • 2. Data everywhere ‣ Global data volume grows exponentially ‣ Information retrieval is BIG business these days ‣ Need means of economically storing and processing large data sets
  • 3. Opportunity ‣ Commodity hardware is ultra cheap ‣ CPU and storage even cheaper
  • 4. Traditional solution ‣ Store data in a (relational) database ‣ Run batch jobs for processing
  • 5. Problems with existing solutions ‣ Databases are seek heavy; B-tree gives log(n) random accesses per update ‣ Seeks are wasted time, nothing of value happens during seeks ‣ Databases do not play well with commoditized hardware (SANs and 16 CPU machines are not in the price sweet spot of performance / $) ‣ Databases were not built with horizontal scaling in mind
  • 6. Solution: sort/merge vs. updating the B-tree ‣ Eliminate the seeks, only sequential reading / writing ‣ Work with batches for efficiency ‣ Parallelize work load ‣ Distribute processing and storage
  • 7. History ‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index ‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge optimization applies ‣ 2004: Google publishes GFS and MapReduce papers ‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve Nutch’ problem; later becomes standalone project ‣ 2011: We’re here learning about it!
  • 8. Hadoop foundations ‣ Commodity hardware (3K - 7K $ machines) ‣ Only sequential reads / writes ‣ Distribution of data and processing across cluster ‣ Built in reliability / fault tolerance / redundancy ‣ Disk based, does not require data or indexes to fit in RAM ‣ Apache licensed, Open Source Software
  • 9.
  • 10. The US government builds their finger print search index using Hadoop.
  • 11.
  • 12. The contents for the People You May Know feature is created by a chain of many MapReduce jobs that run daily. The jobs are reportedly a combination of graph traversal, clustering and assisted machine learning.
  • 13.
  • 14. Amazon’s Frequently Bought Together and Customers Who Bought This Item Also Bought features are brought to you by MapReduce jobs. Recommendation based on large sales transaction datasets is a much seen use case.
  • 15.
  • 16. Top Charts generated daily based on millions of users’ listening behavior.
  • 17.
  • 18. Top searches used for auto-completion are re-generated daily by a MapReduce job using all searches for the past couple of days. Popularity for search terms can be based on counts, but also trending and correlation with other datasets (e.g. trending on social media, news, charts in case of music and movies, best seller lists, etc.)
  • 20. Hadoop Filesystem Friso van Vollenhoven fvanvollenhoven@xebia.com HDFS
  • 21. HDFS overview ‣ Distributed filesystem ‣ Consists of a single master node and multiple (many) data nodes ‣ Files are split up blocks (typically 64MB) ‣ Blocks are spread across data nodes in the cluster ‣ Each block is replicated multiple times to different data nodes in the cluster (typically 3 times) ‣ Master node keeps track of which blocks belong to a file
  • 22. HDFS interaction ‣ Accessible through Java API ‣ FUSE (filesystem in user space) driver available to mount as regular FS ‣ C API available ‣ Basic command line tools in Hadoop distribution ‣ Web interface
  • 23. HDFS interaction ‣ File creation, directory listing and other meta data actions go through the master node (e.g. ls, du, fsck, create file) ‣ Data goes directly to and from data nodes (read, write, append) ‣ Local read path optimization: clients located on same machine as data node will always access local replica when possible
  • 24. Hadoop FileSystem (HDFS) Name Node /some/file /foo/bar HDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  • 25. HDFS daemons: NameNode ‣ Filesystem master node ‣ Keeps track of directories, files and block locations ‣ Assigns blocks to data nodes ‣ Keeps track of live nodes (through heartbeats) ‣ Initiates re-replication in case of data node loss ‣ Block meta data is held in memory • Will run out of memory when too many files exist ‣ Is a SINGLE POINT OF FAILURE in the system • Some solutions exist
  • 26. HDFS daemons: DataNode ‣ Filesystem worker node / “Block server” ‣ Uses underlying regular FS for storage (e.g. ext3) • Takes care of distribution of blocks across disks • Don’t use RAID • More disks means more IO throughput ‣ Sends heartbeats to NameNode ‣ Reports blocks to NameNode (on startup) ‣ Does not know about the rest of the cluster (shared nothing)
  • 27. Things to know about HDFS ‣ HDFS is write once, read many • But has append support in newer versions ‣ Has built in compression at the block level ‣ Does end-to-end checksumming on all data ‣ Has tools for parallelized copying of large amounts of data to other HDFS clusters (distcp) ‣ Provides a convenient file format to gather lots of small files into a single large one • Remember the NameNode running out of memory with too many files? ‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used for batch operations • Optimized for sequential reads, not random access
  • 28. Hadoop Sequence Files ‣ Special type of file to store Key-Value pairs ‣ Stores keys and values as byte arrays ‣ Uses length encoded bytes as format ‣ Often used as input or output format for MapReduce jobs ‣ Has built in compression on values
  • 29. Example: command directory listing friso@fvv:~/java$ hadoop fs -ls / Found 3 items drwxr-xr-x - friso supergroup 0 2011-03-31 17:06 /Users drwxr-xr-x - friso supergroup 0 2011-03-16 14:16 /hbase drwxr-xr-x - friso supergroup 0 2011-04-18 11:33 /user friso@fvv:~/java$
  • 31. Example: copy local file to HDFS friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json
  • 32. MapReduce Friso van Vollenhoven fvanvollenhoven@xebia.com Massively parallelizable computing
  • 33. MapReduce, the algorithm Input data: Required output:
  • 34. Map: extract something useful from each record KEYS VALUES map void map(recordNumber, record) { key = record.findColorfulShape(); map value = record.findGrayShapes(); emit(key, value); map } map map map map map
  • 35. Framework sorts all KeyValue pairs by Key KEYS VALUES KEYS VALUES
  • 36. Reduce: process values for each key KEYS VALUES KEYS VALUES reduce reduce void reduce(key, values) { reduce allGrayShapes = []; foreach (value in values) { allGrayShapes.push(value); } emit(key, allGrayShapes); }
  • 37. MapReduce, the algorithm KEYS VALUES KEYS VALUES KEYS VALUES map reduce map map reduce map map reduce map map map
  • 38. Hadoop MapReduce: parallelized on top of HDFS ‣ Job input comes from files on HDFS • Typically sequence files • Other formats are possible; requires specialized InputFormat implementation • Built in support for text files (convenient for logs, csv, etc.) • Files must be splittable for parallelization to work - Not all compression formats have this property (e.g. gzip)
  • 39. MapReduce daemons: JobTracker ‣ MapReduce master node ‣ Takes care of scheduling and job submission ‣ Splits jobs into tasks (Mappers and Reducers) ‣ Assigns tasks to worker nodes ‣ Reassigns tasks in case of failure ‣ Keeps track of job progress ‣ Keeps track of worker nodes through heartbeats
  • 40. MapReduce daemons: TaskTracker ‣ MapReduce worker process ‣ Starts Mappers en Reducers assigned by JobTracker ‣ Sends heart beats to the JobTracker ‣ Sends task progress to the JobTracker ‣ Does not know about the rest of the cluster (shared nothing)
  • 42. Hadoop MapReduce: Mapper side ‣ Each mapper processes a piece of the total input • Typically blocks that reside on the same machine as the mapper (local datanode) ‣ Mappers sort output by key and store it on the local disk • If the mapper output does not fit in RAM, on disk merge sort happens
  • 43. Hadoop MapReduce: Reducer side ‣ Reducers collect sorted input KeyValue pairs over the network from Mappers • Reducer performs (on disk) merge on inputs from different mappers ‣ Reducer calls the reduce method for each unique key • List of values for each key is read from local disk (the result of the merge) • Values do not need to fit in RAM - Reduce methods that need a global view, need enough RAM to fit all values for a key ‣ Reducer writes output KeyValue pairs to HDFS • Typically blocks go to local data node
  • 45. <PLUG> Summer Classes Big data crunching using Hadoop and other NoSQL tools • Write Hadoop MapReduce jobs in Java • Run on a actual cluster pre-loaded with several datasets • Create a simple application or visualization with the result • Learn about Hadoop without the hassle of building a production cluster first • Have lots of fun! Dates: July 12, August 10 Only € 295,= for a full day course http://www.xebia.com/summerclasses/bigdata </PLUG>