Hadoop: Distributed data processing

Hadoop: Distributed Data Processing
Amr Awadallah
Founder/CTO, Cloudera, Inc.
ACM Data Mining SIG
Thursday, January 25th, 2010

Wednesday, January 27, 2010

Outline

▪Scaling for Large Data
Processing
▪What is Hadoop?

▪HDFS and MapReduce

▪Hadoop Ecosystem

▪Hadoop vs RDBMSes

▪Conclusion
Amr Awadallah, Cloudera Inc 2

Current Storage Systems Can’t Compute



Collection
Instrumentation



Storage Farm for Unstructured Data (20TB/day)
Mostly Append
Collection
Instrumentation



Interactive Apps
RDBMS (200GB/day)
ETL Grid

Mostly Append
Collection
Instrumentation



Interactive Apps
RDBMS (200GB/day)
ETL Grid
Filer heads are a bottleneck

Mostly Append
Collection
Instrumentation



Interactive Apps Ad hoc Queries &
Data Mining
RDBMS (200GB/day)
ETL Grid Non-Consumption
Filer heads are a bottleneck

Mostly Append
Collection
Instrumentation


The Solution: A Store-Compute Grid



Storage + Computation
Mostly Append
Collection
Instrumentation



Interactive Apps
RDBMS
ETL and
Aggregations

Mostly Append
Collection
Instrumentation



Interactive Apps “Batch” Apps
RDBMS
Ad hoc Queries
ETL and & Data Mining
Aggregations

Mostly Append
Collection
Instrumentation


What is Hadoop?


What is Hadoop?
▪A scalable fault-tolerant grid operating
system for data storage and processing


What is Hadoop?
▪ Its scalability comes from the marriage of:

▪ HDFS: Self-Healing High-Bandwidth Clustered Storage
▪ MapReduce: Fault-Tolerant Distributed Processing


What is Hadoop?


▪ Operates on unstructured and structured data


What is Hadoop?


▪ A large and active ecosystem (many developers
and additions like HBase, Hive, Pig, …)


What is Hadoop?


▪ Open source under the friendly Apache License


What is Hadoop?


▪ Open source under the friendly Apache License

▪ http://wiki.apache.org/hadoop/


Hadoop History


Hadoop History
▪ 2002-2004: Doug Cutting and Mike Cafarella started working
on Nutch


Hadoop History
on Nutch
▪ 2003-2004: Google publishes GFS and MapReduce papers


Hadoop History
on Nutch
▪ 2004: Cutting adds DFS & MapReduce support to Nutch


Hadoop History
on Nutch
▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch


Hadoop History
on Nutch
▪ 2007: NY Times converts 4TB of archives over 100 EC2s


Hadoop History
on Nutch
▪ 2008: Web-scale deployments at Y!, Facebook, Last.fm


Hadoop History
on Nutch
▪ April 2008: Yahoo does fastest sort of a TB, 3.5mins over 910
nodes


Hadoop History
on Nutch
nodes
▪ May 2009:
▪ Yahoo does fastest sort of a TB, 62secs over 1460 nodes
▪ Yahoo sorts a PB in 16.25hours over 3658 nodes


Hadoop History
on Nutch
nodes
▪ May 2009:
▪ June 2009, Oct 2009: Hadoop Summit (750), Hadoop World
(500)

Hadoop History
on Nutch
nodes
▪ May 2009:
▪ June 2009, Oct 2009: Hadoop Summit (750), Hadoop World
(500)
▪ September 2009: Doug Cutting joins Cloudera

Hadoop Design Axioms



1. System Shall Manage and Heal Itself



2. Performance Shall Scale Linearly



3. Compute Should Move to Data



3. Compute Should Move to Data
4. Simple Core, Modular and
Extensible


HDFS: Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3

Cost/GB is a few ¢/month
vs $/month

MapReduce: Distributed Processing


MapReduce Example for Word Count
SELECT word, COUNT(1) FROM docs GROUP BY word;
cat *.txt | mapper.pl | sort | reducer.pl > out.txt

Split 1

Split i

Split N


(words, counts)
Split 1 (docid, text) Map 1
Be, 5
“To Be
Or Not
To Be?”
Be, 12

Split i (docid, text) Map i

Be, 7
Be, 6

Split N (docid, text) Map M (words, counts)


(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)

Be, 5 Reduce 1
“To Be
Or Not
To Be?”
Be, 12
Reduce i

Be, 7
Be, 6
Shufﬂe
Reduce R
Split N (docid, text) Map M (words, counts) (sorted words, counts)


(words, counts)
Split 1 (docid, text) Map 1 (sorted words, counts)
Output File
Be, 5 Reduce 1 (sorted words,
sum of counts)
1

“To Be
Or Not Be, 30
To Be?”
Be, 12
Output File i
(sorted words,
Reduce i sum of counts)

Be, 7
Be, 6
Shufﬂe Output File
(sorted words, R
Reduce R sum of counts)
Split N (docid, text) Map M (words, counts) (sorted words, counts)


Hadoop High-Level Architecture
Hadoop Client
Contacts Name Node for data
or Job Tracker to submit jobs

Name Node Job Tracker
Maintains mapping of file blocks Schedules jobs across
to data node slaves task tracker slaves

Data Node Task Tracker
Stores and serves Runs tasks (work units)
blocks of data within a job
Share Physical Node


Apache Hadoop Ecosystem

MapReduce (Job Scheduling/Execution System)

HDFS
(Hadoop Distributed File System)


Zookeepr (Coordination)

Avro (Serialization)

HDFS




HBase (key-value store)

HDFS



ETL Tools BI Reporting RDBMS

Pig (Data Flow) Hive (SQL) Sqoop


HBase (key-value store) (Streaming/Pipes APIs)

HDFS


Use The Right Tool For The Right Job
Hadoop: Relational Databases:


Use The Right Tool For The Right Job
Hadoop: Relational Databases:

When to use? When to use?
• Affordable Storage/ • Interactive Reporting
Compute (<1sec)
• Structured or Not (Agility) • Multistep Transactions
• Resilient Auto Scalability • Interoperability

Economics of Hadoop


Economics of Hadoop
▪ Typical Hardware:
▪ Two Quad Core Nehalems

▪ 24GB RAM
▪ 12 * 1TB SATA disks (JBOD mode, no need for RAID)
▪ 1 Gigabit Ethernet card


Economics of Hadoop

▪ 24GB RAM
▪ Cost/node: $5K/node


Economics of Hadoop

▪ 24GB RAM
▪ Effective HDFS Space:
▪ ¼ reserved for temp shuffle space, which leaves 9TB/node
▪ 3 way replication leads to 3TB effective HDFS space/node
▪ But assuming 7x compression that becomes ~ 20TB/node


Economics of Hadoop

▪ 24GB RAM
Effective Cost per user TB: $250/TB


Economics of Hadoop

▪ 24GB RAM
Effective Cost per user TB: $250/TB
Other solutions cost in the range of $5K to $100K per
user TB

Sample Talks from Hadoop World ‘09
▪ VISA: Large Scale Transaction Analysis
▪ JP Morgan Chase: Data Processing for Financial Services
▪ China Mobile: Data Mining Platform for Telecom Industry
▪ Rackspace: Cross Data Center Log Processing
▪ Booz Allen Hamilton: Protein Alignment using Hadoop
▪ eHarmony: Matchmaking in the Hadoop Cloud
▪ General Sentiment: Understanding Natural Language
▪ Yahoo!: Social Graph Analysis
▪ Visible Technologies: Real-Time Business Intelligence
▪ Facebook: Rethinking the Data Warehouse with Hadoop and
Hive

Slides and Videos at http://www.cloudera.com/hadoop-
Amr Awadallah, Cloudera Inc world-nyc 15

Cloudera Desktop


Conclusion


Conclusion

Hadoop is a data grid
operating system which
provides an economically
scalable solution for storing
and processing large amounts
of unstructured or structured
data over long periods of
time.

Contact Information

Amr Awadallah
CTO, Cloudera Inc.
aaa@cloudera.com
http://twitter.com/awadallah

Online Training Videos and Info:
http://cloudera.com/hadoop-
training
http://cloudera.com/blog
http://twitter.com/cloudera


Hadoop: Distributed data processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Hadoop: Distributed data processing

Similar to Hadoop: Distributed data processing (20)

More from royans

More from royans (16)

Recently uploaded

Recently uploaded (20)

Hadoop: Distributed data processing