Hadoop_Its_Not_Just_Internal_Storage_V14

© 2013 IBM Corporation
Hadoop – It’s Not Just Internal Storage
John Sing, Executive Consultant
IBM Systems and Technology Group Session 1185A Tuesday, June 11, 2013
11 June 2013

IBM Storage Solutions for Big Data
2
John Sing
 31 years of experience with IBM in high end servers, storage,
and software
– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and
Planning, Enterprise Large Scale Storage, Internet Scale Workloads
and Data Center Design, Big Data Analytics, HA/DR/BC
– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business
Continuity, HA/DR/BC, IBM Storage
– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner
for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror)
– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage
– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors
– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA)
 singj@us.ibm.com
 You may follow my daily IT research blog
– http://www.delicious.com/atsf_arizona
 You may follow me on Slideshare.net:
– http://www.slideshare.net/johnsing1
 My LinkedIn:
– http://www.linkedin.com/in/johnsing

4
4
Agenda
 Understanding today’s Hadoop environments
– Hadoop architecture, usage cases, deployments
– Hadoop design, performance, and cost considerations
 Differing Hadoop perspectives: Applications/Business Line vs. Operations
– Understanding implications of direct attached storage (DAS) vs. Shared Storage
 Intelligently choosing Hadoop storage solutions
– Usage cases where Direct Attached Storage makes sense
– Intelligent usage cases where Shared Storage makes sense
– Future evolution of storage, Hadoop, and cross-section of the two
 IBM Hadoop, Storage, Big Data hardware and software components, tools,
offerings
Source: If applicable, describe source origin

IBM Presentation Template Full Version
55
Understanding
today’s Hadoop
environments

6
What is Hadoop?
Instead of the traditional IT computation model:
 Which brings the data to the function/program on application server
 Loads data into memory on an application server and processes it
 Unfortunately, this doesn’t scale for internet-scale Big Data problems
Apache Hadoop: open source framework for data-intensive applications
 Inspired by Google technologies (MapReduce, GFS)
 Well-suited to batch-oriented, read-intensive applications
 Yahoo! adopted these technologies and open sourced them into the Apache Hadoop project
Hadoop has become a pervasive enabler of internet-scale applications, working with thousands
of nodes, petabytes of data in highly parallel, cost effective manner
 CPU + disks of commodity storage = Hadoop “node”
 Hadoop nodes today running mission-critical production in massive clusters
 10s of thousands of servers
 New nodes can be added as needed to the cluster, without changing:
 Data formats, how data is loaded, how jobs are written
6
Tutorials: http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

7
The World of Hadoop: worldwide usage
 eBay
 Linkedin
 Yahoo!
 Facebook
 New York Times
 Many, many
more…
http://www.datanami.com/datanami/2012-04-26/six_super-scale_hadoop_deployments.html
One source for Hadoop users (but not the only one!): http://wiki.apache.org/hadoop/PoweredBy

8
Hadoop is today a well-developed ecosystem
 Hadoop
– Overall name of software
stack
 HDFS
– Hadoop Distributed File
System
 MapReduce
– Software compute framework
• Map = queries
• Reduce=aggregates
answers
 Hive
– Hadoop-based data
warehouse
 Pig
– Hadoop-based language
 Hbase
– Non-relationship database
fast lookups
 Flume
– Populate Hadoop with data
 Oozie
– Workflow processing
system
 Whirr
– Libraries to spin up Hadoop
on Amazon EC2,
Rackspace, etc.
 Avro
– Data serialization
 Mahout
– Data mining
 Sqoop
– Connectivity to non-Hadoop
data stores
 BigTop
– Packaging / interop of all
Hadoop components
http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond
http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

9
Hadoop vendor ecosystem today
http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://www.forbes.com/special-report/2013/industry-atlas.html

10
Why understand the Hadoop stack and
environment?
 Hadoop is being used for much more than just internet-scale Big Data analytics
 Hadoop is increasingly being used by enterprises for inexpensive data storage
– As an industry we’re strongly exploiting a much wider variety of data types
– With tools like Hadoop, it’s become affordable to ingest, analyze, have available an internet-scale
“Big Landing Zone” Hadoop cluster for storing data
• Previously not viable to keep online
– Hadoop cluster also then can run internet-scale analytics on this data
– Significant driver: move to Hadoop to reduce traditional database licensing costs
 Storage industry dynamics:
– Today, JBOD storage in a server chassis might be as low as 4-6 cents/raw GB
• At these prices, adding 50TB usable to Hadoop cluster might only cost $10K in total including server
• Even at typical Hadoop 3X copies, this is still less initial cost than enterprise storage at 26 cents/GB
• Not saying this includes all factors, but these dynamics clearly affect the decision
– And then, there’s flash storage coming…..
Must understand full depth of the Hadoop environment and storage industry dynamics:
– In order to decide if/when/where Hadoop internal storage or shared storage is appropriate

11
Why Hadoop was created for Big Data
Traditional approach : Move data to program
Big Data approach: Move function/programs to data
Database
server
Data
Query Data
return Data
process Data
Master
node
Data
nodes
Data
Application
server
User request
Send result
User request
Send Function to
process on Data
Query &
process Data
Data
nodes
Data
Data
nodes
Data
Data
nodes
Data
Send Consolidate result
Traditional approach
Application server and Database
server are separate
Data can be on multiple servers
Analysis Program can run on
multiple Application servers
Network is still in the middle
Data has to go through network
•Big Data Approach
 Analysis Program runs where the
data is : on Data Node
Only Analysis Program has to go
through the network
Analysis Program need to be
MapReduce aware
Highly Scalable :
1000s Nodes
Petabytes and more
Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide

12
Example of Hadoop in action
Traditional approach : Move data to program
Database
server
Data
Query Data
return Data
process Data
Application
server
User request
Send result
 Big Data approach : Move program to Data
Master
node
Data
nodes
Data
User request
Send Function to
process on Data
Query &
process Data
Data
nodes
Data
Data
nodes
Data
Data
nodes
Data
Send Consolidate result
Example :
How many hours Clint Eastwood
appears in all the movies he has
done ?
All movies need to be parsed to
find Clint’s face
Traditional approach : All
movies are uploaded to
application server, through the
network
• Big Data Approach : The
Analysis Program and copy of
Clint’s picture are downloaded
to data nodes, through the
network
Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide

13
Hadoop principles: Storage, HDFS and MapReduce
 Hadoop Distributed File System = HDFS : where Hadoop stores the data
– HDFS file system spans all the nodes in a cluster with locality awareness
 Hadoop data storage, computation model
– Data stored in a distributed file system, spanning many inexpensive computers
– Send function/program to the data nodes
– i.e. distribute application to compute resources where the data is stored
– Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute map
tasks to cluster
Hadoop Data Nodes
Data is loaded,
spread, resident
in Hadoop cluster
Performance =
tuning Map Reduce workflow,
network, application,
servers, and storage
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
http://www.slideshare.net/allenwittenauer/2012-lihadoopperf

14
Big Data hadoop system architecture
Data
node
Data
node
Data
node
Management nodes
Namenode nodes
JobTracker nodes
Data nodes with
local disks
Network
1-10GB Ethernet
or Infiniband
Network
1-10GB Ethernet
or Infiniband
Management
nodes for Hadoop
and cluster
Management
nodes for Hadoop
and cluster
IO performance = type and # of disks
Reference architecture:
From 12-24 disks, ~1.5GB/s, >35TB, 12-16 CPUs per datanode
Hadoop Distributed File System (HDFS)
• HDFS stores data across multiple data nodes, Namenode knows where data is
• HDFS assumes data nodes and disks will fail, so it achieves reliability by replicating data across
multiple data nodes (typically 3 or more)
• HDFS file system is built from a cluster of data nodes, each of which serves up blocks of data
over the network using a block protocol specific to HDFS
• HDFS Name Node is a single point of failure
Scaling granularity: data node,
scaling both IO and CPU
Locality
awareness
Note: any other location for
data adds network latency

1515
Differing Hadoop
storage
perspectives

16
Understanding Hadoop rationale for
Direct Attached Storage (latency)
Primary Hadoop design goal is affordability at internet scale:
 Data is loaded into Hadoop cluster with data locality
– Spreading data across Data Nodes
– Achieve lowest disk latency through direct attached storage
 Send programs to data (not other way around)
– Data in general does not move within the Hadoop cluster
 Key performance components: disk latency, network interconnect, utilization, bandwidth
 Based on low capital expenditure, low cost commodity components
– Goal: lowest capital cost at scale (adapters, switches and # of ports)
Hadoop Application and performance tuning:
 Fallacy: “all Hadoop jobs are IO-bound”
 Truth: there are many many Hadoop workflow and tuningvariables, widely varying workloads
– CPU/storage ratio different for different workloads
Network latency is major performance impact on Hadoop cluster
– Adding external storage layer network latency causes major retuning of network
Hadoop Application
team to Operations:
“Until you’ve read the
Hadoop book, please
don’t waste my time”
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/
http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/

17
Yet, there are valid operational issues with Hadoop from
Enterprise Shared Storage management, cost standpoint
 Servers under-utilized?
 Another storage silo?
 Amount of physical storage required per usable GB/TB?
 Reliability as Hadoop application goes into mission critical production?
 Hadoop-specific storage management, migration, backup, recovery?
 Hadoop-specific skill set?
 Ability to understand what data is used where?
 Audit, security, legacy application integration?
 Share Hadoop storage (and servers) dynamically, in a pool with other data center
resources?
Ultimately, it becomes a matter of perspective, type of infrastructure,
and associated priority. Let’s explore this further………..

18
Today: two different types of IT
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/
Internet scale wkloadsTransactional IT

19
Today’s two major IT workload types
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads

20
How to build these two different clouds
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/
Transactional IT
Internet scale wkloads

21
Hadoop storage choices based on perspective:
This is where a Hadoop external shared
storage infrastructure may often be found
This is where Hadoop
DAS-focused infrastructure
may often be found

22
Differing valid perspectives on Hadoop storage issues
Very specific reasons why Direct Access Storage is
used:
 Performance and throughput (lowest latency)
 Low cost commodity components
– cost of JBOD at 4-6 cents/GB today
– Even at 3x copies, still very inexpensive
 Many Hadoop workflow, software components to
tune:
– Map and Reduce workflow
– Memory allocation and usage
– Algorithms, tuning at all levels
– What are the tasks doing
 Hadoop overall cluster configuration
– Server and DAS storage configuration
– 3X copies for performance reasons
– Squeeze out all latency
– Network topology, speeds, utilization
– Compression
– Type of data
 Etc…..……
Very specific reasons why shared storage is desired:
 Cost CAPEX / OPEX?
– Fixed server/storage ratio?
– Low server % utilization = excess cost?
 Reliability?
 Backup? Disaster Recovery?
 Another silo of storage?
 Managing data:
– Within the Hadoop cluster
– Between Hadoop and other existing storage?
Hadoop Applications,
Business Line team
Operations team
Clearly, different perspectives!

23
Bottom line on Direct Attached Storage (DAS)
vs. Shared Storage for Hadoop
 Avoid “brute force” one-for-one direct replacement of Hadoop direct attached storage with external shared
storage
– This is too blunt an instrument
• Doesn’t intelligently consider Hadoop design characteristics, performance requirements, overall Hadoop cluster
tuning, workload variations, customer’s environment
 Instead, an intelligent, blended Hadoop storage approach, with full awareness of the Hadoop stack and customer
environment, and multiple perspectives:
– To identify cases where Direct Attached Storage (DAS) makes sense
• Many Hadoop cases where DAS is the correct Hadoop primary storage choice
• For issues of very large scale, performance and throughput, minimize network, adapter costs
– To identify cases where shared storage makes sense
• While maintaining the Hadoop benefits of DAS latency, cost, scale
• Specific intelligent implementations are effective, if designed properly with full Hadoop stack awareness
 Without an intelligent in-depth Hadoop-aware approach:
– Likely may not meet Hadoop performance or cost objectives
• Replacing DAS one-for-one with external shared storage today isn’t cost-effective at true internet scale
• SAN switches / port costs today cannot affordably reach thousands of data nodes
– Must use intelligent approach, otherwise SAN/NAS will introduce significant % disk IO latency increase
• Requiring rebalancing of entire Hadoop cluster and requiring more expensive networking costs
http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf

2424
Intelligently
choosing Hadoop
storage solutions

25
Intelligently using Hadoop shared storage: goals
 Wish to perform mixed workloads on a shared storage infrastructure
– Some storage for Hadoop, other storage for other things, all on the same storage devices
 Have a desire to trade off reduced number of Hadoop copies by exploiting higher storage reliability
– Saving on total Hadoop physical storage space
 Exploit external storage placement/migration/storage mgmt strategy and capabilities
 Exploit configurable storage recovery policies, backup/restore
 Exploit your existing storage infrastructure in balanced, cost-effective way
 Reduce need for Hadoop storage allocation skills and manual management of Hadoop data
 Exploit existing shared storage infrastructure tooling / performance monitors
 Add audit, security, legacy integration opportunities leveraged out of existing infrastructure
– Avoiding silo’d Hadoop storage environment
 Decoupling servers from storage:
– Enable using smaller servers (less power, cooling)
– Enable better use of resources on differing workloads with differing server/storage ratios
– Dynamically allocate servers and storage to work on differing and changing analytics
workloads

26
Intelligent usage cases for
shared external storage in Hadoop
 Intelligent usage cases where external shared storage supplements and is appropriate for Hadoop:
 Stage 1:
– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary
storage while still preserving Direct Attach Storage data locality, ability for internet scale
– While using external storage to bring desired function or reduce number of Hadoop copies
– Examples: Nseries Open Solution for Hadoop; GPFS File Placement Optimizer
 Stage 2:
– Augment Hadoop DAS primary storage with 2nd
storage layer (external file system, NAS, or SAN) as a
data protection or archival layer.
– Intelligently allocating, importing, exporting data appropriately
 Stage 3:
– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)
– Appropriate for certain clusters and certain Hadoop environments where:
• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits
• Example: IBM GPFS Storage Server
Stage 3
Stage 1
Stage 2
Hadoop Stages originally published by John Webster, Evaluator Group, http://www.evaluatorgroup.com/about/principals/
http://searchstorage.techtarget.com/video/Alternatives-to-DAS-in-Hadoop-storage
http://searchstorage.techtarget.com/answer/Can-shared-storage-be-used-with-Hadoop-architecture
http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster

27
IBM Big Data Networked Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 1
Example: IBM
DCS3700 with
Hadoop
replication
count = 2
Still direct
attached data
locality

28
IBM Big Data Network Storage Solution for Hadoop
Stage 1
Hadoop
Storage
building
blocks
IBM
Storage
Hadoop
replication
count = 2
Hadoop
Improved
Namenode
protection

29
Another option: Hadoop environment using
IBM GPFS-FPO (File Placement Optimizer)
MapReduce Cluster
M
a
p
R
e
d
u
c
e
M
a
p
R
e
d
u
c
e
UsersJobs
G
P
F
S
-
F
P
O
 GPFS File Placement Optimzer instead of HDFS - still places disk local to each server
 Aggregates the local disk space into a single redundant shared file GPFS system
 Designed for MapReduce workloads
 Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy
 Intended as a drop in replacement for open source HDFS (IBM BigInsights product
may be required)
Stage 1
IBM
General
Parallel File
System
FPO
Instead of
HDFS

30
GPFS 3.5 HDFS
Performance
Terasort: large reads  
Hbase: small write  
Metadata intensive  
Enterprise readiness
POSIX compliance 
Meta-data replication 
Distributed name node 
Protection &
Recovery
Snapshot 
Asynchronous Replication 
Backup 
Security & Integrity Access Control Lists 
Ease of Use Policy based Ingest 
GPFS File Placement Optimizer shared
storage advantages in Hadoop environment
Stage 1

31
Augment Hadoop Storage with external storage
Data
node
Data
node
Data
node
Management nodes
Namenode nodes
JobTracker nodes
Compute node
Compute node
Compute node
Compute node
Management nodes
Job submission nodes
Batch scheduler nodes
HDFS
External storage
Possibilities:
•Allocate one of Hadoop copies externally
•Move data back and forth between Hadoop
and external storage
Stage 2

32
Another option: augment Hadoop with
IBM General Parallel File System in “Stage 2” configuration
Data
node
Data
node
Data
node
Management nodes
Namenode nodes
JobTracker nodes
Compute node
Compute node
Compute node
Compute node
Management nodes
Job submission nodes
Batch scheduler nodes
GPFS
Storage
Server
GPFS
Storage
server
GPFS-FPO
POSIX GPFS
Add GPFS Cluster
POSIX world
All nodes can write/read data
• Integration with existing or new external GPFS cluster
• Policy based file movement in/out of GPFS-File Placement Optimizer pool
• Seamlessly integrate tape as part of the same namespace
Stage 2

33
Replace Hadoop DAS with intelligent
external Hadoop storage implementation
Compute node1
Compute node3
Compute node2
Namenode nodes
JobTracker nodes
GPFS
Storage
Server
GPFS
Storage
server
/gpfs/node1/dsk1
/gpfs/node1/dsk2
…
/gpfs/node1/dskX
/gpfs
/gpfs/node2/dsk1
/gpfs/node2/dsk2
…
/gpfs/node2/dskX
/gpfs/node3/dsk1
/gpfs/node3/dsk2
…
/gpfs/node3/dskX
HDFS
Stage 3
Example:
GPFS
Storage
Server

34
IBM Big Data Network Storage Solution for Hadoop
Stage 3
Hadoop
Improved
Namenode
protection
Hadoop
Storage
building
blocks
Other IBM
Storage
Hadoop
replication
count = 2NAS
SAN
IBM NAS filer
NAS
SAN

35
Future evolution: Hadoop, storage, intersection of the two
 Continued evolution of Big Data workloads, Hadoop, and storage are all fast moving
targets
– Already in mid-2013, we’re seeing HDFS 2.0 offering HA, snapshots, better resiliency
• http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum
– We are seeing a huge adoption rate of Hadoop as inexpensive cheap, deep storage
 More importantly, very soon flash storage costs will start to affect Hadoop reference
architectures
– By 2015, costs on SSD will reach point (15 cents/GB) that future yet-to-be-determined Hadoop
deployments
– Will start move Hadoop bottleneck from storage to network interconnect
– Whoever best solves that future network interconnect issue will be the next big Hadoop winner
 Today’s intelligent Hadoop usage cases will continue to evolve quickly. Watch this
space!

3636
IBM Hadoop
Storage components,
tools, offerings

37
Big Data application stack
User Interface Layer
Reports, Dashboards, Mashups, Search,
Ad hoc reporting, Spreadsheets
Analytic Process Layer
Real-time computing and analysis, stream computing,
entity analytics, data mining, data proximity, content
management, text analytics, etc.
Infrastructure layer
Virtualization, central end to end management, control,
deployment on software, server, storage in a
geographically dispersed environment
Users
Security
authorization
OS software
Location of
competitive
advantage
Analytics
applications.
Cloud infrastructure
layer Servers, storage
IBM Big Data Software
Visualization layer
Analytics layer

38
IBM Big Data Analytics Solutions
Streaming
Data
Traditional
Warehouse
Analytics on
Data at Rest
Data
Warehouse
Analytics on
Structured Data
Analytics on
Data In-Motion
IBM InfoSphere
BigInsights
Traditional /
Relational
Data Sources
Non-Traditional /
Non-Relational
Data Sources
Non-Traditional/
Non-Relational
Data Sources
Traditional/Relational
Data Sources
Internet-Scale
Data Sets
IBM InfoSphere
Streams

39
Big Data infrastructure layer
User Interface Layer
Reports, Dashboards, Mashups, Search,
Ad hoc reporting, Spreadsheets
Analytic Process Layer
Real-time computing and analysis, stream computing,
entity analytics, data mining, data proximity, content
management, text analytics, etc.
Infrastructure layer
Virtualization, central end to end management, control,
deployment on software, server, storage in a
geographically dispersed environment
Users
Security
authorization
OS software
Cloud infrastructure
layer Servers, storage
Visualization layer
Analytics layer

40
IBM Direct Attached Storage solutions for Hadoop
Rack-Level Features
Up to 20 System x3630 M4 nodes
Up to 6 System x3550 M4
Management nodes
Up to 960TB storage
Up to 240 Intel Sany Bridge cores
Up to 3,840GB memory
Up to two 10Gb Ethernet (IBM
G8264-T) switches
Scalable to multi-rack configurations
Available Enterprise and
Performance Features
Redundant storage
Redundant networking
High performance cores
Increased memory
High performance networking
Reference architecture
High volume x86 systems
Integrated solution
PureData System for Hadoop
Each system has local storage

41
JBOD
Disk Enclosure
x3650 M4 Server
Storage solution includes Data Servers,
Disk (2TB or 3TB NL-SAS, SSD), Software,
InfiniBand / Ethernet with no Storage Controllers
GSS 24: Light and Fast
2 3650 servers +
4 JBOD 20U rack
10 GB/Sec
GSS 26: Workhorse
2 3650 servers +
6 JBOD Enclosures, 28U
12 GB/sec
High-Density Option
6 3650 servers + 18 JBOD
2 - 42U Standard Racks
36 GB/sec
IBM external Big Data storage:
GPFS Storage Server scalable building block approach
GPFS
software
RAID

42
High Volume & Availability : Mainframe & Open
Storage for Distributed Systems
Storage
management
SW
Tivoli Storage
Productivity Center
Tivoli Storage
FlashCopy Manager
Tivoli Storage
Manager
Tivoli Key Lifecycle
Manager
XIV
SONASDS8000
Optimized System Storage
N seriesStorwize V7000
Unified
Storwize V7000
Integrated
Innovation
Storage
Virtualization
SW and SVC
Real-time
Compression
Déduplication
DS3500/DCS3700
Integrated Solutions
Virtual Storage Center
Easy Tier
IBM Active
Cloud EngineTM
Linear Tape File
System (LTFS)
IBM Shared Storage infrastructure solutions
V7000 Unified
V7000 Unified
V7000 Unified
V7000 Unified
Tape Library
TS3310
Tape Virtualization
TS7740
Tape Automation
TS3500
Tape drives
LTO 3, 4 and 5
ProtecTIER
TS7610/20/50
Data protection & retention

43
IBM solutions for a Big Data world
IBM Netezza
Storwize
V7000
“Unified”
Storage
“File”
Storage
“Block”
Storage
Disks 3TB, 4 TB
• Storwize V7000
• XIV Gen3
• DS8800
Solid State Drives (SDD)
• Storwize V7000
• XIV Gen3
• DS8800
Scale Out NAS
(SONAS)
IBM Tape
Systems
2.7 ExaBytes
TS3500
InfoSphere Streams
GPFS
Storage
Server

44
Learning Points
 Many, most cases where traditional Hadoop Direct Attached Storage is appropriate
 However, many Intelligent usage cases where Hadoop external shared storage, intelligently
implemented, brings significant value
 Stage 1:
– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary
storage while still preserving Direct Attach Storage data locality, ability for internet scale
– While using external storage to bring desired function or reduce number of Hadoop copies
 Stage 2:
– Augment Hadoop DAS primary storage with 2nd
storage layer (external file system, NAS, or SAN) as a
data protection or archival layer
– Intelligently allocating, importing, exporting data appropriately
 Stage 3:
– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)
– Appropriate for certain clusters and certain Hadoop environments where:
• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits
 Most importantly, Hadoop and Storage topic is both fast moving, constantly evolving
– Soon, adoption of Hadoop primary flash storage will significantly change Hadoop dynamics
– Will move Hadoop bottleneck from storage to network

45

46
Trademarks and disclaimers
© IBM Corporation 2011. All rights reserved.
References in this document to IBM products or services do not imply that IBM intends to make them available in every country.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT
Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel
Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and
the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office
of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java
and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment,
Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM
Corp. and Quantum in the U.S. and other countries.
Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind.
The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and
performance characteristics may vary by customer.
Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an
endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and
vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions
on the capability of non-IBM products should be addressed to the supplier of those products.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery
schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current
investment and development activities as a good faith effort to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience
will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.
Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM
representative or Business Partner for the most current pricing in your geography.
Photographs shown may be engineering prototypes. Changes may be incorporated in production models.
Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the
World Wide Web at http://www.ibm.com/legal/copytrade.shtml.
ZSP03490-USEN-00

47
Appendix

48
Recommend you download, read,
this very informative IBM book
 ”Understanding Big Data”
– Published April 2012
– Free download
– Well worth reading to understand components
of Big Data, and how to exploit
 Part 1: The Big Deal about Big Data
– Chapter 1 – What is Big Data? Hint: You’re a
Part of it Every Day
– Chapter 2 – Why Big Data is Important
– Chapter 3 – Why IBM for Big Data
 Part II: Big Data: From the Technology
Perspective
– Chapter 4 - All About Hadoop: The Big Data
Lingo Chapter
– Chapter 5 – IBM InfoSphere Big Insights –
Analytics for “At Rest” Big Data
– Chapter 6 – IBM InfoSphere Streams –
Analytics for “In Motion” Big Data
http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
Download your free copy here

49
IBM InfoSphere BigInsights = IBM Hadoop distribution
Core
Hadoop
BigInsights Basic
Edition
BigInsights Enterprise
Edition
Free download with web support
Limit to <= 10 TB of data
(Optional: 24x7 paid support
Fixed Term License)
Professional Services Offerings
QuickStart, Bootcamp, Education, Custom Development
Enterprise-grade features:
Tiered Terabyte-based pricing
Easy Installation
And programming
Analytics tooling/visualization
Administration tooling
Development tooling
High Availability
Flexible storage
Recoverability
Security

Hadoop_Its_Not_Just_Internal_Storage_V14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop_Its_Not_Just_Internal_Storage_V14

Similar to Hadoop_Its_Not_Just_Internal_Storage_V14 (20)

More from John Sing

More from John Sing (9)

Recently uploaded

Recently uploaded (20)

Hadoop_Its_Not_Just_Internal_Storage_V14

Editor's Notes