Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Hadoop, HDFS and MapReduce
1. Hadoop and
MapReduce
Friso van Vollenhoven
fvanvollenhoven@xebia.com
The workings of the elephant
2. Data everywhere
‣ Global data volume grows exponentially
‣ Information retrieval is BIG business these days
‣ Need means of economically storing and processing large data sets
5. Problems with existing solutions
‣ Databases are seek heavy; B-tree gives log(n) random accesses per update
‣ Seeks are wasted time, nothing of value happens during seeks
‣ Databases do not play well with commoditized hardware (SANs and 16 CPU
machines are not in the price sweet spot of performance / $)
‣ Databases were not built with horizontal scaling in mind
6. Solution: sort/merge vs. updating the B-tree
‣ Eliminate the seeks, only sequential reading / writing
‣ Work with batches for efficiency
‣ Parallelize work load
‣ Distribute processing and storage
7. History
‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index
‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge
optimization applies
‣ 2004: Google publishes GFS and MapReduce papers
‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve
Nutch’ problem; later becomes standalone project
‣ 2011: We’re here learning about it!
8. Hadoop foundations
‣ Commodity hardware (3K - 7K $ machines)
‣ Only sequential reads / writes
‣ Distribution of data and processing across cluster
‣ Built in reliability / fault tolerance / redundancy
‣ Disk based, does not require data or indexes to fit in RAM
‣ Apache licensed, Open Source Software
12. The contents for the People You May Know feature is
created by a chain of many MapReduce jobs that
run daily. The jobs are reportedly a combination of
graph traversal, clustering and assisted machine
learning.
13.
14. Amazon’s Frequently Bought Together and Customers Who Bought This Item Also
Bought features are brought to you by MapReduce jobs. Recommendation
based on large sales transaction datasets is a much seen use case.
18. Top searches used for auto-completion are re-generated daily by a
MapReduce job using all searches for the past couple of days.
Popularity for search terms can be based on counts, but also trending
and correlation with other datasets (e.g. trending on social media,
news, charts in case of music and movies, best seller lists, etc.)
20. Hadoop
Filesystem
Friso van Vollenhoven
fvanvollenhoven@xebia.com
HDFS
21. HDFS overview
‣ Distributed filesystem
‣ Consists of a single master node and multiple (many) data nodes
‣ Files are split up blocks (typically 64MB)
‣ Blocks are spread across data nodes in the cluster
‣ Each block is replicated multiple times to different data nodes in the cluster
(typically 3 times)
‣ Master node keeps track of which blocks belong to a file
22. HDFS interaction
‣ Accessible through Java API
‣ FUSE (filesystem in user space) driver available to mount as regular FS
‣ C API available
‣ Basic command line tools in Hadoop distribution
‣ Web interface
23. HDFS interaction
‣ File creation, directory listing and other meta data actions go through the master
node (e.g. ls, du, fsck, create file)
‣ Data goes directly to and from data nodes (read, write, append)
‣ Local read path optimization: clients located on same machine as data node will
always access local replica when possible
24. Hadoop FileSystem (HDFS)
Name Node
/some/file /foo/bar
HDFS client
create file
read data
Date Node Date Node Date Node
write data
DISK DISK DISK
Node local
HDFS client
DISK DISK DISK
replicate
DISK DISK DISK
read data
25. HDFS daemons: NameNode
‣ Filesystem master node
‣ Keeps track of directories, files and block locations
‣ Assigns blocks to data nodes
‣ Keeps track of live nodes (through heartbeats)
‣ Initiates re-replication in case of data node loss
‣ Block meta data is held in memory
• Will run out of memory when too many files exist
‣ Is a SINGLE POINT OF FAILURE in the system
• Some solutions exist
26. HDFS daemons: DataNode
‣ Filesystem worker node / “Block server”
‣ Uses underlying regular FS for storage (e.g. ext3)
• Takes care of distribution of blocks across disks
• Don’t use RAID
• More disks means more IO throughput
‣ Sends heartbeats to NameNode
‣ Reports blocks to NameNode (on startup)
‣ Does not know about the rest of the cluster (shared nothing)
27. Things to know about HDFS
‣ HDFS is write once, read many
• But has append support in newer versions
‣ Has built in compression at the block level
‣ Does end-to-end checksumming on all data
‣ Has tools for parallelized copying of large amounts of data to other HDFS
clusters (distcp)
‣ Provides a convenient file format to gather lots of small files into a single large
one
• Remember the NameNode running out of memory with too many files?
‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used
for batch operations
• Optimized for sequential reads, not random access
28. Hadoop Sequence Files
‣ Special type of file to store Key-Value pairs
‣ Stores keys and values as byte arrays
‣ Uses length encoded bytes as format
‣ Often used as input or output format for MapReduce jobs
‣ Has built in compression on values
38. Hadoop MapReduce: parallelized on top of HDFS
‣ Job input comes from files on HDFS
• Typically sequence files
• Other formats are possible; requires specialized InputFormat implementation
• Built in support for text files (convenient for logs, csv, etc.)
• Files must be splittable for parallelization to work
- Not all compression formats have this property (e.g. gzip)
39. MapReduce daemons: JobTracker
‣ MapReduce master node
‣ Takes care of scheduling and job submission
‣ Splits jobs into tasks (Mappers and Reducers)
‣ Assigns tasks to worker nodes
‣ Reassigns tasks in case of failure
‣ Keeps track of job progress
‣ Keeps track of worker nodes through heartbeats
40. MapReduce daemons: TaskTracker
‣ MapReduce worker process
‣ Starts Mappers en Reducers assigned by JobTracker
‣ Sends heart beats to the JobTracker
‣ Sends task progress to the JobTracker
‣ Does not know about the rest of the cluster (shared nothing)
42. Hadoop MapReduce: Mapper side
‣ Each mapper processes a piece of the total input
• Typically blocks that reside on the same machine as the mapper (local
datanode)
‣ Mappers sort output by key and store it on the local disk
• If the mapper output does not fit in RAM, on disk merge sort happens
43. Hadoop MapReduce: Reducer side
‣ Reducers collect sorted input KeyValue pairs over the network from Mappers
• Reducer performs (on disk) merge on inputs from different mappers
‣ Reducer calls the reduce method for each unique key
• List of values for each key is read from local disk (the result of the merge)
• Values do not need to fit in RAM
- Reduce methods that need a global view, need enough RAM to fit all values
for a key
‣ Reducer writes output KeyValue pairs to HDFS
• Typically blocks go to local data node
45. <PLUG>
Summer Classes
Big data crunching using Hadoop and other NoSQL tools
• Write Hadoop MapReduce jobs in Java
• Run on a actual cluster pre-loaded with several datasets
• Create a simple application or visualization with the result
• Learn about Hadoop without the hassle of building a production cluster first
• Have lots of fun!
Dates: July 12, August 10
Only € 295,= for a full day course
http://www.xebia.com/summerclasses/bigdata
</PLUG>