An introduction to Hadoop for large scale data analysis

Hadoop – Large scale data analysis Abhijit Sharma Page 1 | 9/8/2011

Unprecedented growth in Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2 | 9/8/2011 Big Data Trends

| 9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats), Data Mining – Clustering - Google News articles Search - Google

Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4 | 9/8/2011 Problem characteristics and examples

Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5 | 9/8/2011 What is Hadoop?

MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6 | 9/8/2011 Map Reduce - Definition

Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7 | 9/8/2011 Map Reduce - Functional Programming Origins

| 9/8/2011 Word Count - Map Reduce

mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum) Page 10 | 9/8/2011 Word Count - Pseudo code

Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..]) Page 11 | 9/8/2011 Examples – Map Reduce Defn

Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12 | 9/8/2011 Map Reduce – Hadoop Implementation

| 9/8/2011 Hadoop Map Reduce Architecture

Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent replication ,[object Object],Page 14 | 9/8/2011 HDFS Characteristics

| 9/8/2011 HDFS Architecture

Thanks Page 16 | 9/8/2011

| 9/8/2011 Backup Slides

| 9/8/2011 Map & Reduce Functions

| 9/8/2011 Job Configuration

Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20 | 9/8/2011 Hadoop Map Reduce Components

Name Node Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21 | 9/8/2011 HDFS

An introduction to Hadoop for large scale data analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a An introduction to Hadoop for large scale data analysis

Similar a An introduction to Hadoop for large scale data analysis (20)

Último

Último (20)

An introduction to Hadoop for large scale data analysis