More Related Content Similar to Hadoop 101 (20) Hadoop 1012. Agenda
• What is Hadoop ?
• History of Hadoop
• Hadoop Components
• Hadoop Ecosystem
• Customer Use Cases
• Hadoop Challenges
© Copyright 2011 EMC Corporation. All rights reserved. 2
4. What is Hadoop?
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using a simple programming model. It is designed to scale up from single
servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.
• Concepts :
– NameNode (aka Master) is responsible for managing the
namespace repository (index) for the filesystem, and
managing Jobs
– DataNode (aka Segment) is responsible for storing blocks of
data and running tasks
– MapReduce – Push compute to where data resides
© Copyright 2011 EMC Corporation. All rights reserved. 4
5. What is Hadoop and Where did it start?
• Created by Doug Cutting
– HDFS (storage) & MapReduce (compute)
– Inspired by Google’s MapReduce and Google File
System (GFS) papers
• It is now a top-level Apache project backed by large
open source development community
• Three major subprojects
– Nutch
– Lucene
– Hadoop
© Copyright 2011 EMC Corporation. All rights reserved. 5
6. What makes Hadoop Different?
• Hadoop is a complete paradigm shift
• Bypasses 25yrs of enterprise ceilings
• Hadoop also defers some really difficult challenges:
– Non-transactional
– File System is essentially read-only
• “Greater potential for the Hadoop architecture to mature
and handle the complexity of transactions then RDBMS
to figure out failures, and data growth”
© Copyright 2011 EMC Corporation. All rights reserved. 6
7. Confluence of Factors
• Hadoop makes analytics on large scale data sets
more pragmatic
– BI Solutions often suffer from garbage-in, garbage-out
– Opens up new ways of understanding and thus running lines of business
• Classic Architectures won’t scale any further
• New sources of information (social media) are too
big and unwieldy for traditional solutions
– 5 year enterprise data growth is estimated at 650% - with over 80% of that
unstructured data EG. Facebook collects 100TB per day
• What works for Google & Facebook!
© Copyright 2011 EMC Corporation. All rights reserved. 7
8. Hadoop vs Relational Solutions
• Hadoop is a paradigm shift in the way we think about and manage data
• Traditional solutions were not designed with growth in mind
• Big-Data accelerates this problem dramatically
Category Traditional RDBMS Hadoop
Resource constrained Linear Expansion
Re-architecture Seamless addition & subtraction of
Scalability nodes
~ 5TB ~ 5PB
Fault After thought, many critical Designed in, tasks are
Tolerance points of failure automatically restarted
Transactional, OLTP Batch, OLAP (today!)
Problem
Space Inability to incorporate new No bounds
sources
© Copyright 2011 EMC Corporation. All rights reserved. 8
10. History
• Google paper: Simplified Data Processing on Large Clusters – 2006
– GFS & MapReduce framework
• Top Level Apache Open Source Community Project -
2008
• Yahoo, Facebook, and Powerset become the main contributors, with
Yahoo running over 10K nodes (300K cores) - 2009
• Hadoop cluster at Yahoo sets Terasort benchmark standard – Jul 08
– 209s to sort 1TB (62 seconds NOW)
– Cluster Config
• 910 Nodes - 4, dual core Xeons @ 2.0GHz, 8GB RAM, 4-
SATA disks
• 1Gb Ethernet
• 40 nodes per rack
• 8Gb Ethernet uplinks from each rack
• RH Server 5.1
• Sun JDK 1.6.0_05-b13
© Copyright 2011 EMC Corporation. All rights reserved. 10
12. Analytics
Component Design R
Mahout
MapReduce
Packages
python HBase
java Hive
stream Pig
HDFS
jbod jbod jbod jbod jbod …..
© Copyright 2011 EMC Corporation. All rights reserved. 12
13. Hadoop Components
Two Core Components
HDFS MapReduce
Storage Compute
• Storage & Compute in One Framework
• Open Source Project of the Apache Software Foundation
• Written in Java
© Copyright 2011 EMC Corporation. All rights reserved. 13
15. HDFS Concepts
• Sits on top of native (ext3, xfs, etc.) file system
• Performs best with a ‘modest’ number of large files
– Millions, rather than billions, of files
– Each file typically 100Mb or more
• Files in HDFS are ‘write once’
– No random writes to files are allowed
– Append support is available in Hadoop 0.21
• HDFS is optimized for large, streaming reads of files
– Rather than random reads
© Copyright 2011 EMC Corporation. All rights reserved. 15
16. HDFS
• Hadoop Distributed File System
– Data is organized into files & directories
– Files are divided into blocks, typically 64-128MB each,
and distributed across cluster nodes
– Block placement is known at runtime by map-reduce so
computation can be co-located with data
– Blocks are replicated (default is 3 copies) to handle
failure
– Checksums are used to ensure data integrity
• Replication is the one and only strategy for error
handling, recovery and fault tolerance
– Make multiple copies and be happy!
© Copyright 2011 EMC Corporation. All rights reserved. 16
18. HDFS Components
• NameNode
• DataNode
• Standby NameNode
• Job Tracker
• Task Tracker
© Copyright 2011 EMC Corporation. All rights reserved. 18
19. NameNode
• Provides a centralized, repository for the
namespace
– A index of what files are stored in which blocks
• Responds to client requests (map-reduce jobs) by
coordinating distribution of tasks (algorithm
– Make multiple copies and be happy!
• In Memory only
– 0.23 provides distributed namenode
– Namenode recovery must re-build entire meta-data
repository
© Copyright 2011 EMC Corporation. All rights reserved. 19
20. Hadoop Architecture - HDFS
• Block level storage
• N-Node replication
• Namenode for
– File system index (EditLog)
– Access coordination
– IPC via TCP/IP
• Datanode for Put
– Data Block Management
– Job Execution (MapReduce)
• Automated Fault Tolerance
© Copyright 2011 EMC Corporation. All rights reserved. 20
21. Job Tracker
• MapReduce jobs are controlled by a software daemon
known as the JobTracker
• The JobTracker resides on a single node
– Clients submit MapReduce jobs to the JobTracker
– The JobTracker assigns Map and Reduce tasks to other
nodes on the cluster
– These nodes each run a software daemon known as the
TaskTracker
– The TaskTracker is responsible for actually instantiating the
Map or Reduce task, and reporting progress back to the
JobTracker
• A Job consists of a collection of Map & Reduce Tasks
© Copyright 2011 EMC Corporation. All rights reserved. 21
23. Map Reduce Framework
• Map step
– Input records are parsed into
intermediate key/value pairs
– Multiple Maps per Node
• 10TB => 128MB/Blk => 82K Maps
• Reduce step
– Each Reducer handles all like
keys
– 3 Steps
• Shuffle: All like keys are retrieved
from each Mapper
• Sort: Intermediate keys are sorted
prior to reduce
• Reduce: Values are processed
© Copyright 2011 EMC Corporation. All rights reserved. 23
25. Reduce Task
• After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list
• This list is given to a Reducer
– There may be a single Reducer, or multiple Reducers
– This is specified as part of the job configuration (see later)
– All values associated with a particular intermediate key are
guaranteed to go to the same Reducer
– The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order
– This step is known as the ‘shuffle and sort’
• The Reducer outputs zero or more final key/value pairs
– These are written to HDFS
• In practice, the Reducer usually emits a single key/value pair
for each input key
© Copyright 2011 EMC Corporation. All rights reserved. 25
26. Fault Tolerance
• HDFS will only allocate jobs to active nodes
• Map-Reduce can compensate for slow running jobs
– If a Mapper appears to be running significantly more slowly than the
others, a new instance of the Mapper will be started on another
machine, operating on the same data
– The results of the first Mapper to finish will be used
– Hadoop will kill off the Mapper which is still running
• Yahoo experiences multiple failures (> 10) of various
components (drives, cables, servers) every day
– Which have exactly 0 impact on operations
© Copyright 2011 EMC Corporation. All rights reserved. 26
29. Ecosystem Distribution by Role
Distribution Reporting Analytics Monitoring
Apache
Manageability
Training
IDE Consulting
Data Integration
Hadoop-ide
Data Visualization
UAP
© Copyright 2011 EMC Corporation. All rights reserved. 29
30. Hadoop Components (hadoop.apache.org)
HDFS • Hadoop Distributed File System
MapReduce • Framework for writing scalable data applications
Pig • Procedural language that abstracts lower level MapReduce
Zookeeper • Highly reliable distributed coordination
Hive • System for querying data and managing structured data built on top of
HDFS (SQL-like query)
HBase • Database for random, real time read/write access
Oozie • workflow/coordination to manage jobs
Mahout • Scalable machine learning libraries
© Copyright 2011 EMC Corporation. All rights reserved. 30
31. Technology Adoption Lifecycle
Today
Innovators/ Early Majority Late Majority Laggards
Early Adopters
© Copyright 2011 EMC Corporation. All rights reserved. 31
33. Hbase Overview
• Hbase is a sparse, distributed, persistent,
scalable, reliable multi-dimensional map which is
indexed by row key
– Hadoop Database, ~ “No-SQL” database
– Many relational features
– Scalable: Region Servers
– Multiple client access: java, ReST, Thrift
• What’s it good for?
– Queries against a number of rows that makes your
Oracle Server puke!
• Hbase leverages HDFS for its storage
© Copyright 2011 EMC Corporation. All rights reserved. 33
34. HBase in Practice
• High performance, real-time query
• Client is typically a Java program
• But HBase supports many other API’s:
– JSON: Java Script Object Notation
– REST: Representational State Transfer
– Thrift, Avro: Frameworks..
© Copyright 2011 EMC Corporation. All rights reserved. 34
35. Hbase – key/value Store
• Excellent key-based access to a specific cell or
sequential cells of data
• Column Oriented Architecture ( like GPDB)
– Column Families related attributes often queried
together
– Members are stored together
• Versioning of cells is used to provide update
capability
– Change to an existing cell is stored as a new version
by timestamp
• No transactional guarantee
© Copyright 2011 EMC Corporation. All rights reserved. 35
36. Hive
• Data Warehousing package built on top of Hadoop
• System for managing and querying structured data
– Leverages MapReduce for execution
– Utilizes HDFS (or HBase) for storage
• Data is stored in tables
– Consists of separate Schema metastore and data files
• HiveQL is a sql-like language
– Queries are converted into MapReduce jobs
© Copyright 2011 EMC Corporation. All rights reserved. 36
37. Hive – Basics & Syntax
--- Hive example
-- set hive to use local (non-hdfs) storage
hive > SET mapred.job.tracker=local;
Tell Hive to use a local hive > SET mapred.local.dir=/Users/hardar/Documents/training/
repository for mapreduce, not HDWorkshop/labs/9.hive/data
hive > SET hive.exec.mode.local.auto=false;
hdfs
-- setup hive storage location in hdfs - if not using local
Create repository folders in $ hadoop fs -mkdir /tmp
hdfs $ hadoop fs -mkdir /user/hive/warehouse
$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /user/hive/warehouse
Create a Customers table
-- create an orders table
Load data from local file create table orders (orderid bigint, customerid bigint, productid
int, qty int, rate int, estdlvdate string, status string) row format
system delimited fields terminated by ",";
-- load some data
load data local inpath '9.hive/data/orders.txt'
into table orders;
-- query
Create a Products table select * from orders;
-- create a product table
Load data from local file create table products (productid int, description string) row
system format delimited fields terminated by ",";
-- load some data
load data local inpath '9.hive/data/products.txt' into table products;
-- select * from products.
© Copyright 2011 EMC Corporation. All rights reserved. 37
38. Pig
• Provides a mechanism for using MapReduce without
programming in Java
– Utilizes HDFS & MapReduce
• Allows for a more intuitive means to specify data
flows
– High-level sequential, data flow language
– Pig Latin
– Python integration
• Comfortable for researchers who are familiar with Perl &
Python
• Pig is easier to learn & execute, but more limited
in scope of functionality then java
© Copyright 2011 EMC Corporation. All rights reserved. 38
39. PIG – Basics & Syntax
-- file : demographic.pig
--
-- extracts INCOME (in thousands) and
ZIPCODE from census data. Filters out
ZERO incomes
Define a table and load grunt> DEMO_TABLE = LOAD 'data/
directly from local file input/demo_sample.txt' using
PigStorage(',') AS
(gender:chararray, age:int, income:int,
zip:chararray);
Describe -- describe DEMO_TABLE
grunt> describe DEMO_TABLE;
## run mr job to dump DEMO_TABLE
Select * from
grunt> dump DEMO_TABLE;
## store DEMO_TABLE in hdfs
Store the data in hdfs grunt> store DEMO_TABLE into '/gphd/
pig/DEMO_TABLE';
© Copyright 2011 EMC Corporation. All rights reserved. 39
41. Mahout
• Important stuff first: most common pronunciation is “Ma-h-
out” – rhymes with ‘trout’
• Machine Learning Library that Runs on HDFS
• 4 Primary Use Cases:
– Recommendation Mining – People who like X, also like Y
– Clustering – Topic based association
– Classification – Assign new docs to existing categories
– Frequent Item set Mining – Which things will appear together
© Copyright 2011 EMC Corporation. All rights reserved. 41
42. Revolutions Analytics R
• Statistical programming language for Hadoop
– Open Source & Revolution R Enterprise
– More than just counts and averages
• Ability to manipulate HDFS directly from R
• Mimics Java APIs
© Copyright 2011 EMC Corporation. All rights reserved. 42
44. Hadoop Use Cases
• Internet • Social
– Search Index Generation – Recommendations
– User Engagement Behavior – Network Graphs
– Targeting / Advertising Optimizations – Feed Updates
– Recommendations • Enterprises
– email analysis, and image processing
• BioMed – ETL
– Computational BioMedical Systems – Reporting & Analytics
– Bioinformatics – Natural Language Processing
– Data Mining and Genome Analysis
• Media/Newspapers
• Financial – Image Conversions
– Prediction Models
• Agriculture
– Fraud Analysis – Process “agri” stream
– Portfolio Risk Management
• Image
• Telecom – Geo-Spatial processing
– Call data records
– Set top & DVR streams • Education
– Systems Research
– Statistical analysis of stuff on the web
© Copyright 2011 EMC Corporation. All rights reserved. 44
45. Greenplum Hadoop Customers
How our customers are using Hadoop
• Return Path
– World’s leader in email certification & scoring
– Uses Hadoop & Hbase to store & process ISP data
– Replaced Cloudera with Greenplum MR
• American Express
– Early stages of developing Big Data Analytics strategy
– Greenplum MR selected over Cloudera
– Chose GP b/c of EMC Support & Existing Relationship
• SunGard
– IT company focusing on availability services
– Choose Greenplum MR as platform for big-data-analytics-as-a-
service
– Compete against AWS Elastic MapReduce
© Copyright 2011 EMC Corporation. All rights reserved. 45
46. Major Telco: CDR Churn Analysis
• Business problem: Construct a churn model to provide early
detection of customers who are going to end their contracts
• Available data
– Dependent variable: did a customer leave in a 4-month period?
– Independent variables: various features on customer call history
– ~120,000 training data points, ~120,000 test data points
• First attempt
– Use R, specifically the Generalised Additive Models (GAM) package
– Quickly built a model that matched T-Mobile’s existing model
© Copyright 2011 EMC Corporation. All rights reserved. 46
48. Hadoop Pain Points
Integrated Product • No Integrated Hadoop Stack
Suite • Hadoop, Pig, Hive, Hbase, Zookeeper, Oozie, Mahout…
• No Industry standard ETL and BI Stack Integration
Interoperability • Informatica, Microstrategy, Business Objects …
• Poor Job and Application Monitoring Solution
Monitoring • Non-existent Performance Monitoring
Operability and • Complex System Configuration and Manageability
Manageability • No Data Format Interoperability & Storage Abstractions
• Poor Dimensional Lookup Performance
Performance • Very poor Random Access and Serving Performance
© Copyright 2011 EMC Corporation. All rights reserved. 48
49. Data Co-Processing
Analytic Productivity
Applications, Tools, Chorus
Data Computing Interfaces
SQL, MapReduce, In-Database Analytics, Parallel Data Loading (batch or real-time)
Greenplum Database Hadoop
Compute Compute
parallel
data exchange
Storage Storage
SQL DB parallel
MapReduce
Engine data exchange Engine
Network
All Data Types
• unstructured data • geospatial data
• structured data • sensor data
• temporal data • spatial data
© Copyright 2011 EMC Corporation. All rights reserved. 49
50. Questions……?
&
THANK YOU
© Copyright 2011 EMC Corporation. All rights reserved. 50