SlideShare a Scribd company logo
1 of 48
© 2013 IBM Corporation
Hadoop – It’s Not Just Internal Storage
John Sing, Executive Consultant
IBM Systems and Technology Group Session 1185A Tuesday, June 11, 2013
11 June 2013
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
2
John Sing
 31 years of experience with IBM in high end servers, storage,
and software
– 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and
Planning, Enterprise Large Scale Storage, Internet Scale Workloads
and Data Center Design, Big Data Analytics, HA/DR/BC
– 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business
Continuity, HA/DR/BC, IBM Storage
– 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner
for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror)
– 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage
– 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors
– 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA)
 singj@us.ibm.com
 You may follow my daily IT research blog
– http://www.delicious.com/atsf_arizona
 You may follow me on Slideshare.net:
– http://www.slideshare.net/johnsing1
 My LinkedIn:
– http://www.linkedin.com/in/johnsing
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
4
4
Agenda
 Understanding today’s Hadoop environments
– Hadoop architecture, usage cases, deployments
– Hadoop design, performance, and cost considerations
 Differing Hadoop perspectives: Applications/Business Line vs. Operations
– Understanding implications of direct attached storage (DAS) vs. Shared Storage
 Intelligently choosing Hadoop storage solutions
– Usage cases where Direct Attached Storage makes sense
– Intelligent usage cases where Shared Storage makes sense
– Future evolution of storage, Hadoop, and cross-section of the two
 IBM Hadoop, Storage, Big Data hardware and software components, tools,
offerings
Source: If applicable, describe source origin
© 2013 IBM Corporation
IBM Presentation Template Full Version
55
Understanding
today’s Hadoop
environments
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
6
What is Hadoop?
Instead of the traditional IT computation model:
 Which brings the data to the function/program on application server
 Loads data into memory on an application server and processes it
 Unfortunately, this doesn’t scale for internet-scale Big Data problems
Apache Hadoop: open source framework for data-intensive applications
 Inspired by Google technologies (MapReduce, GFS)
 Well-suited to batch-oriented, read-intensive applications
 Yahoo! adopted these technologies and open sourced them into the Apache Hadoop project
Hadoop has become a pervasive enabler of internet-scale applications, working with thousands
of nodes, petabytes of data in highly parallel, cost effective manner
 CPU + disks of commodity storage = Hadoop “node”
 Hadoop nodes today running mission-critical production in massive clusters
 10s of thousands of servers
 New nodes can be added as needed to the cluster, without changing:
 Data formats, how data is loaded, how jobs are written
6
Tutorials: http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
7
The World of Hadoop: worldwide usage
 eBay
 Linkedin
 Yahoo!
 Facebook
 New York Times
 Many, many
more…
http://www.datanami.com/datanami/2012-04-26/six_super-scale_hadoop_deployments.html
One source for Hadoop users (but not the only one!): http://wiki.apache.org/hadoop/PoweredBy
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
8
Hadoop is today a well-developed ecosystem
 Hadoop
– Overall name of software
stack
 HDFS
– Hadoop Distributed File
System
 MapReduce
– Software compute framework
• Map = queries
• Reduce=aggregates
answers
 Hive
– Hadoop-based data
warehouse
 Pig
– Hadoop-based language
 Hbase
– Non-relationship database
fast lookups
 Flume
– Populate Hadoop with data
 Oozie
– Workflow processing
system
 Whirr
– Libraries to spin up Hadoop
on Amazon EC2,
Rackspace, etc.
 Avro
– Data serialization
 Mahout
– Data mining
 Sqoop
– Connectivity to non-Hadoop
data stores
 BigTop
– Packaging / interop of all
Hadoop components
http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond
http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
9
Hadoop vendor ecosystem today
http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://www.forbes.com/special-report/2013/industry-atlas.html
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
10
Why understand the Hadoop stack and
environment?
 Hadoop is being used for much more than just internet-scale Big Data analytics
 Hadoop is increasingly being used by enterprises for inexpensive data storage
– As an industry we’re strongly exploiting a much wider variety of data types
– With tools like Hadoop, it’s become affordable to ingest, analyze, have available an internet-scale
“Big Landing Zone” Hadoop cluster for storing data
• Previously not viable to keep online
– Hadoop cluster also then can run internet-scale analytics on this data
– Significant driver: move to Hadoop to reduce traditional database licensing costs
 Storage industry dynamics:
– Today, JBOD storage in a server chassis might be as low as 4-6 cents/raw GB
• At these prices, adding 50TB usable to Hadoop cluster might only cost $10K in total including server
• Even at typical Hadoop 3X copies, this is still less initial cost than enterprise storage at 26 cents/GB
• Not saying this includes all factors, but these dynamics clearly affect the decision
– And then, there’s flash storage coming…..
Must understand full depth of the Hadoop environment and storage industry dynamics:
– In order to decide if/when/where Hadoop internal storage or shared storage is appropriate
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
11
Why Hadoop was created for Big Data
Traditional approach : Move data to program
Big Data approach: Move function/programs to data
Database
server
Data
Query Data
return Data
process Data
Master
node
Data
nodes
Data
Application
server
User request
Send result
User request
Send Function to
process on Data
Query &
process Data
Data
nodes
Data
Data
nodes
Data
Data
nodes
Data
Send Consolidate result
Traditional approach
Application server and Database
server are separate
Data can be on multiple servers
Analysis Program can run on
multiple Application servers
Network is still in the middle
Data has to go through network
•Big Data Approach
 Analysis Program runs where the
data is : on Data Node
Only Analysis Program has to go
through the network
Analysis Program need to be
MapReduce aware
Highly Scalable :
1000s Nodes
Petabytes and more
Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
12
Example of Hadoop in action
Traditional approach : Move data to program
Database
server
Data
Query Data
return Data
process Data
Application
server
User request
Send result
 Big Data approach : Move program to Data
Master
node
Data
nodes
Data
User request
Send Function to
process on Data
Query &
process Data
Data
nodes
Data
Data
nodes
Data
Data
nodes
Data
Send Consolidate result
Example :
How many hours Clint Eastwood
appears in all the movies he has
done ?
All movies need to be parsed to
find Clint’s face
Traditional approach : All
movies are uploaded to
application server, through the
network
• Big Data Approach : The
Analysis Program and copy of
Clint’s picture are downloaded
to data nodes, through the
network
Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
13
Hadoop principles: Storage, HDFS and MapReduce
 Hadoop Distributed File System = HDFS : where Hadoop stores the data
– HDFS file system spans all the nodes in a cluster with locality awareness
 Hadoop data storage, computation model
– Data stored in a distributed file system, spanning many inexpensive computers
– Send function/program to the data nodes
– i.e. distribute application to compute resources where the data is stored
– Scalable to thousands of nodes and petabytes of data
MapReduce Application
1. Map Phase
(break job into small parts)
2. Shuffle
(transfer interim output
for final processing)
3. Reduce Phase
(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text val, Context
StringTokenizer itr =
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
. . .
Distribute map
tasks to cluster
Hadoop Data Nodes
Data is loaded,
spread, resident
in Hadoop cluster
Performance =
tuning Map Reduce workflow,
network, application,
servers, and storage
http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
http://www.slideshare.net/allenwittenauer/2012-lihadoopperf
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
14
Big Data hadoop system architecture
Data
node
Data
node
Data
node
Management nodes
Namenode nodes
JobTracker nodes
Data nodes with
local disks
Network
1-10GB Ethernet
or Infiniband
Network
1-10GB Ethernet
or Infiniband
Management
nodes for Hadoop
and cluster
Management
nodes for Hadoop
and cluster
IO performance = type and # of disks
Reference architecture:
From 12-24 disks, ~1.5GB/s, >35TB, 12-16 CPUs per datanode
Hadoop Distributed File System (HDFS)
• HDFS stores data across multiple data nodes, Namenode knows where data is
• HDFS assumes data nodes and disks will fail, so it achieves reliability by replicating data across
multiple data nodes (typically 3 or more)
• HDFS file system is built from a cluster of data nodes, each of which serves up blocks of data
over the network using a block protocol specific to HDFS
• HDFS Name Node is a single point of failure
Scaling granularity: data node,
scaling both IO and CPU
Locality
awareness
Note: any other location for
data adds network latency
© 2013 IBM Corporation
IBM Presentation Template Full Version
1515
Differing Hadoop
storage
perspectives
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
16
Understanding Hadoop rationale for
Direct Attached Storage (latency)
Primary Hadoop design goal is affordability at internet scale:
 Data is loaded into Hadoop cluster with data locality
– Spreading data across Data Nodes
– Achieve lowest disk latency through direct attached storage
 Send programs to data (not other way around)
– Data in general does not move within the Hadoop cluster
 Key performance components: disk latency, network interconnect, utilization, bandwidth
 Based on low capital expenditure, low cost commodity components
– Goal: lowest capital cost at scale (adapters, switches and # of ports)
Hadoop Application and performance tuning:
 Fallacy: “all Hadoop jobs are IO-bound”
 Truth: there are many many Hadoop workflow and tuningvariables, widely varying workloads
– CPU/storage ratio different for different workloads
Network latency is major performance impact on Hadoop cluster
– Adding external storage layer network latency causes major retuning of network
Hadoop Application
team to Operations:
“Until you’ve read the
Hadoop book, please
don’t waste my time”
http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/
http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
17
Yet, there are valid operational issues with Hadoop from
Enterprise Shared Storage management, cost standpoint
 Servers under-utilized?
 Another storage silo?
 Amount of physical storage required per usable GB/TB?
 Reliability as Hadoop application goes into mission critical production?
 Hadoop-specific storage management, migration, backup, recovery?
 Hadoop-specific skill set?
 Ability to understand what data is used where?
 Audit, security, legacy application integration?
 Share Hadoop storage (and servers) dynamically, in a pool with other data center
resources?
Ultimately, it becomes a matter of perspective, type of infrastructure,
and associated priority. Let’s explore this further………..
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
18
Today: two different types of IT
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/
Internet scale wkloadsTransactional IT
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
19
Today’s two major IT workload types
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
20
How to build these two different clouds
Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/
Transactional IT
Internet scale wkloads
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
21
Hadoop storage choices based on perspective:
This is where a Hadoop external shared
storage infrastructure may often be found
This is where Hadoop
DAS-focused infrastructure
may often be found
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
22
Differing valid perspectives on Hadoop storage issues
Very specific reasons why Direct Access Storage is
used:
 Performance and throughput (lowest latency)
 Low cost commodity components
– cost of JBOD at 4-6 cents/GB today
– Even at 3x copies, still very inexpensive
 Many Hadoop workflow, software components to
tune:
– Map and Reduce workflow
– Memory allocation and usage
– Algorithms, tuning at all levels
– What are the tasks doing
 Hadoop overall cluster configuration
– Server and DAS storage configuration
– 3X copies for performance reasons
– Squeeze out all latency
– Network topology, speeds, utilization
– Compression
– Type of data
 Etc…..……
Very specific reasons why shared storage is desired:
 Cost CAPEX / OPEX?
– Fixed server/storage ratio?
– Low server % utilization = excess cost?
 Reliability?
 Backup? Disaster Recovery?
 Another silo of storage?
 Managing data:
– Within the Hadoop cluster
– Between Hadoop and other existing storage?
Hadoop Applications,
Business Line team
Operations team
Clearly, different perspectives!
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
23
Bottom line on Direct Attached Storage (DAS)
vs. Shared Storage for Hadoop
 Avoid “brute force” one-for-one direct replacement of Hadoop direct attached storage with external shared
storage
– This is too blunt an instrument
• Doesn’t intelligently consider Hadoop design characteristics, performance requirements, overall Hadoop cluster
tuning, workload variations, customer’s environment
 Instead, an intelligent, blended Hadoop storage approach, with full awareness of the Hadoop stack and customer
environment, and multiple perspectives:
– To identify cases where Direct Attached Storage (DAS) makes sense
• Many Hadoop cases where DAS is the correct Hadoop primary storage choice
• For issues of very large scale, performance and throughput, minimize network, adapter costs
– To identify cases where shared storage makes sense
• While maintaining the Hadoop benefits of DAS latency, cost, scale
• Specific intelligent implementations are effective, if designed properly with full Hadoop stack awareness
 Without an intelligent in-depth Hadoop-aware approach:
– Likely may not meet Hadoop performance or cost objectives
• Replacing DAS one-for-one with external shared storage today isn’t cost-effective at true internet scale
• SAN switches / port costs today cannot affordably reach thousands of data nodes
– Must use intelligent approach, otherwise SAN/NAS will introduce significant % disk IO latency increase
• Requiring rebalancing of entire Hadoop cluster and requiring more expensive networking costs
http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf
© 2013 IBM Corporation
IBM Presentation Template Full Version
2424
Intelligently
choosing Hadoop
storage solutions
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
25
Intelligently using Hadoop shared storage: goals
 Wish to perform mixed workloads on a shared storage infrastructure
– Some storage for Hadoop, other storage for other things, all on the same storage devices
 Have a desire to trade off reduced number of Hadoop copies by exploiting higher storage reliability
– Saving on total Hadoop physical storage space
 Exploit external storage placement/migration/storage mgmt strategy and capabilities
 Exploit configurable storage recovery policies, backup/restore
 Exploit your existing storage infrastructure in balanced, cost-effective way
 Reduce need for Hadoop storage allocation skills and manual management of Hadoop data
 Exploit existing shared storage infrastructure tooling / performance monitors
 Add audit, security, legacy integration opportunities leveraged out of existing infrastructure
– Avoiding silo’d Hadoop storage environment
 Decoupling servers from storage:
– Enable using smaller servers (less power, cooling)
– Enable better use of resources on differing workloads with differing server/storage ratios
– Dynamically allocate servers and storage to work on differing and changing analytics
workloads
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
26
Intelligent usage cases for
shared external storage in Hadoop
 Intelligent usage cases where external shared storage supplements and is appropriate for Hadoop:
 Stage 1:
– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary
storage while still preserving Direct Attach Storage data locality, ability for internet scale
– While using external storage to bring desired function or reduce number of Hadoop copies
– Examples: Nseries Open Solution for Hadoop; GPFS File Placement Optimizer
 Stage 2:
– Augment Hadoop DAS primary storage with 2nd
storage layer (external file system, NAS, or SAN) as a
data protection or archival layer.
– Intelligently allocating, importing, exporting data appropriately
 Stage 3:
– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)
– Appropriate for certain clusters and certain Hadoop environments where:
• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits
• Example: IBM GPFS Storage Server
Stage 3
Stage 1
Stage 2
Hadoop Stages originally published by John Webster, Evaluator Group, http://www.evaluatorgroup.com/about/principals/
http://searchstorage.techtarget.com/video/Alternatives-to-DAS-in-Hadoop-storage
http://searchstorage.techtarget.com/answer/Can-shared-storage-be-used-with-Hadoop-architecture
http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
27
IBM Big Data Networked Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 1
Example: IBM
DCS3700 with
Hadoop
replication
count = 2
Still direct
attached data
locality
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
28
IBM Big Data Network Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 1
Hadoop
Storage
building
blocks
IBM
Storage
Hadoop
replication
count = 2
Hadoop
Improved
Namenode
protection
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
29
Another option: Hadoop environment using
IBM GPFS-FPO (File Placement Optimizer)
MapReduce Cluster
M
a
p
R
e
d
u
c
e
M
a
p
R
e
d
u
c
e
UsersJobs
G
P
F
S
-
F
P
O
 GPFS File Placement Optimzer instead of HDFS - still places disk local to each server
 Aggregates the local disk space into a single redundant shared file GPFS system
 Designed for MapReduce workloads
 Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy
 Intended as a drop in replacement for open source HDFS (IBM BigInsights product
may be required)
Stage 1
IBM
General
Parallel File
System
FPO
Instead of
HDFS
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
30
GPFS 3.5 HDFS
Performance
Terasort: large reads  
Hbase: small write  
Metadata intensive  
Enterprise readiness
POSIX compliance 
Meta-data replication 
Distributed name node 
Protection &
Recovery
Snapshot 
Asynchronous Replication 
Backup 
Security & Integrity Access Control Lists 
Ease of Use Policy based Ingest 
GPFS File Placement Optimizer shared
storage advantages in Hadoop environment
Stage 1
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
31
Augment Hadoop Storage with external storage
Data
node
Data
node
Data
node
Management nodes
Namenode nodes
JobTracker nodes
Compute node
Compute node
Compute node
Compute node
Management nodes
Job submission nodes
Batch scheduler nodes
HDFS
External storage
Possibilities:
•Allocate one of Hadoop copies externally
•Move data back and forth between Hadoop
and external storage
Stage 2
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
32
Another option: augment Hadoop with
IBM General Parallel File System in “Stage 2” configuration
Data
node
Data
node
Data
node
Management nodes
Namenode nodes
JobTracker nodes
Compute node
Compute node
Compute node
Compute node
Management nodes
Job submission nodes
Batch scheduler nodes
GPFS
Storage
Server
GPFS
Storage
server
GPFS-FPO
POSIX GPFS
Add GPFS Cluster
POSIX world
All nodes can write/read data
• Integration with existing or new external GPFS cluster
• Policy based file movement in/out of GPFS-File Placement Optimizer pool
• Seamlessly integrate tape as part of the same namespace
Stage 2
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
33
Replace Hadoop DAS with intelligent
external Hadoop storage implementation
Compute node1
Compute node3
Compute node2
Namenode nodes
JobTracker nodes
GPFS
Storage
Server
GPFS
Storage
server
/gpfs/node1/dsk1
/gpfs/node1/dsk2
…
/gpfs/node1/dskX
/gpfs
/gpfs/node2/dsk1
/gpfs/node2/dsk2
…
/gpfs/node2/dskX
/gpfs/node3/dsk1
/gpfs/node3/dsk2
…
/gpfs/node3/dskX
HDFS
Stage 3
Example:
GPFS
Storage
Server
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
34
IBM Big Data Network Storage Solution for Hadoop
http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html
Stage 3
Hadoop
Improved
Namenode
protection
Hadoop
Storage
building
blocks
Other IBM
Storage
Hadoop
replication
count = 2NAS
SAN
IBM NAS filer
NAS
SAN
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
35
Future evolution: Hadoop, storage, intersection of the two
 Continued evolution of Big Data workloads, Hadoop, and storage are all fast moving
targets
– Already in mid-2013, we’re seeing HDFS 2.0 offering HA, snapshots, better resiliency
• http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum
– We are seeing a huge adoption rate of Hadoop as inexpensive cheap, deep storage
 More importantly, very soon flash storage costs will start to affect Hadoop reference
architectures
– By 2015, costs on SSD will reach point (15 cents/GB) that future yet-to-be-determined Hadoop
deployments
– Will start move Hadoop bottleneck from storage to network interconnect
– Whoever best solves that future network interconnect issue will be the next big Hadoop winner
 Today’s intelligent Hadoop usage cases will continue to evolve quickly. Watch this
space!
© 2013 IBM Corporation
IBM Presentation Template Full Version
3636
IBM Hadoop
Storage components,
tools, offerings
Hadoop – It’s Not Just Internal Storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
37
Big Data application stack
User Interface Layer
Reports, Dashboards, Mashups, Search,
Ad hoc reporting, Spreadsheets
Analytic Process Layer
Real-time computing and analysis, stream computing,
entity analytics, data mining, data proximity, content
management, text analytics, etc.
Infrastructure layer
Virtualization, central end to end management, control,
deployment on software, server, storage in a
geographically dispersed environment
Users
Security
authorization
OS software
Location of
competitive
advantage
Analytics
applications.
Cloud infrastructure
layer Servers, storage
IBM Big Data Software
Visualization layer
Analytics layer
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
38
IBM Big Data Analytics Solutions
Streaming
Data
Traditional
Warehouse
Analytics on
Data at Rest
Data
Warehouse
Analytics on
Structured Data
Analytics on
Data In-Motion
IBM InfoSphere
BigInsights
Traditional /
Relational
Data Sources
Non-Traditional /
Non-Relational
Data Sources
Non-Traditional/
Non-Relational
Data Sources
Traditional/Relational
Data Sources
Internet-Scale
Data Sets
IBM InfoSphere
Streams
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
39
Big Data infrastructure layer
User Interface Layer
Reports, Dashboards, Mashups, Search,
Ad hoc reporting, Spreadsheets
Analytic Process Layer
Real-time computing and analysis, stream computing,
entity analytics, data mining, data proximity, content
management, text analytics, etc.
Infrastructure layer
Virtualization, central end to end management, control,
deployment on software, server, storage in a
geographically dispersed environment
Users
Security
authorization
OS software
Cloud infrastructure
layer Servers, storage
Visualization layer
Analytics layer
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
40
IBM Direct Attached Storage solutions for Hadoop
Rack-Level Features
Up to 20 System x3630 M4 nodes
Up to 6 System x3550 M4
Management nodes
Up to 960TB storage
Up to 240 Intel Sany Bridge cores
Up to 3,840GB memory
Up to two 10Gb Ethernet (IBM
G8264-T) switches
Scalable to multi-rack configurations
Available Enterprise and
Performance Features
Redundant storage
Redundant networking
High performance cores
Increased memory
High performance networking
Reference architecture
High volume x86 systems
Integrated solution
PureData System for Hadoop
Each system has local storage
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
41
JBOD
Disk Enclosure
x3650 M4 Server
Storage solution includes Data Servers,
Disk (2TB or 3TB NL-SAS, SSD), Software,
InfiniBand / Ethernet with no Storage Controllers
GSS 24: Light and Fast
2 3650 servers +
4 JBOD 20U rack
10 GB/Sec
GSS 26: Workhorse
2 3650 servers +
6 JBOD Enclosures, 28U
12 GB/sec
High-Density Option
6 3650 servers + 18 JBOD
2 - 42U Standard Racks
36 GB/sec
IBM external Big Data storage:
GPFS Storage Server scalable building block approach
GPFS
software
RAID
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
42
High Volume & Availability : Mainframe & Open
Storage for Distributed Systems
Storage
management
SW
Tivoli Storage
Productivity Center
Tivoli Storage
FlashCopy Manager
Tivoli Storage
Manager
Tivoli Key Lifecycle
Manager
XIV
SONASDS8000
Optimized System Storage
N seriesStorwize V7000
Unified
Storwize V7000
Integrated
Innovation
Storage
Virtualization
SW and SVC
Real-time
Compression
Déduplication
DS3500/DCS3700
Integrated Solutions
Virtual Storage Center
Easy Tier
IBM Active
Cloud EngineTM
Linear Tape File
System (LTFS)
IBM Shared Storage infrastructure solutions
V7000 Unified
V7000 Unified
V7000 Unified
V7000 Unified
Tape Library
TS3310
Tape Virtualization
TS7740
Tape Automation
TS3500
Tape drives
LTO 3, 4 and 5
ProtecTIER
TS7610/20/50
Data protection & retention
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
43
IBM solutions for a Big Data world
IBM Netezza
Storwize
V7000
“Unified”
Storage
“File”
Storage
“Block”
Storage
Disks 3TB, 4 TB
• Storwize V7000
• XIV Gen3
• DS8800
Solid State Drives (SDD)
• Storwize V7000
• XIV Gen3
• DS8800
Scale Out NAS
(SONAS)
IBM Tape
Systems
2.7 ExaBytes
TS3500
InfoSphere Streams
GPFS
Storage
Server
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
44
Learning Points
 Many, most cases where traditional Hadoop Direct Attached Storage is appropriate
 However, many Intelligent usage cases where Hadoop external shared storage, intelligently
implemented, brings significant value
 Stage 1:
– Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary
storage while still preserving Direct Attach Storage data locality, ability for internet scale
– While using external storage to bring desired function or reduce number of Hadoop copies
 Stage 2:
– Augment Hadoop DAS primary storage with 2nd
storage layer (external file system, NAS, or SAN) as a
data protection or archival layer
– Intelligently allocating, importing, exporting data appropriately
 Stage 3:
– Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN)
– Appropriate for certain clusters and certain Hadoop environments where:
• Network rebalancing, adapter/network costs, scale are in line with shared storage benefits
 Most importantly, Hadoop and Storage topic is both fast moving, constantly evolving
– Soon, adoption of Hadoop primary flash storage will significantly change Hadoop dynamics
– Will move Hadoop bottleneck from storage to network
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
45
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
46
Trademarks and disclaimers
© IBM Corporation 2011. All rights reserved.
References in this document to IBM products or services do not imply that IBM intends to make them available in every country.
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT
Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel
Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and
the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office
of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java
and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment,
Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM
Corp. and Quantum in the U.S. and other countries.
Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind.
The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and
performance characteristics may vary by customer.
Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an
endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and
vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions
on the capability of non-IBM products should be addressed to the supplier of those products.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only.
Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery
schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current
investment and development activities as a good faith effort to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience
will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here.
Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM
representative or Business Partner for the most current pricing in your geography.
Photographs shown may be engineering prototypes. Changes may be incorporated in production models.
Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the
World Wide Web at http://www.ibm.com/legal/copytrade.shtml.
ZSP03490-USEN-00
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
47
Appendix
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
48
Recommend you download, read,
this very informative IBM book
 ”Understanding Big Data”
– Published April 2012
– Free download
– Well worth reading to understand components
of Big Data, and how to exploit
 Part 1: The Big Deal about Big Data
– Chapter 1 – What is Big Data? Hint: You’re a
Part of it Every Day
– Chapter 2 – Why Big Data is Important
– Chapter 3 – Why IBM for Big Data
 Part II: Big Data: From the Technology
Perspective
– Chapter 4 - All About Hadoop: The Big Data
Lingo Chapter
– Chapter 5 – IBM InfoSphere Big Insights –
Analytics for “At Rest” Big Data
– Chapter 6 – IBM InfoSphere Streams –
Analytics for “In Motion” Big Data
http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
Download your free copy here
© 2013 IBM Corporation
IBM Storage Solutions for Big Data
49
IBM InfoSphere BigInsights = IBM Hadoop distribution
Core
Hadoop
BigInsights Basic
Edition
BigInsights Enterprise
Edition
Free download with web support
Limit to <= 10 TB of data
(Optional: 24x7 paid support
Fixed Term License)
Professional Services Offerings
QuickStart, Bootcamp, Education, Custom Development
Enterprise-grade features:
Tiered Terabyte-based pricing
Easy Installation
And programming
Analytics tooling/visualization
Administration tooling
Development tooling
High Availability
Flexible storage
Recoverability
Security

More Related Content

What's hot

Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6longda feng
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 

What's hot (20)

Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 

Similar to Hadoop_Its_Not_Just_Internal_Storage_V14

Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)Xavier Constant
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersMrigendra Sharma
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Hadoop training kit from lcc infotech
Hadoop   training kit from lcc infotechHadoop   training kit from lcc infotech
Hadoop training kit from lcc infotechlccinfotech
 

Similar to Hadoop_Its_Not_Just_Internal_Storage_V14 (20)

Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop
HadoopHadoop
Hadoop
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Anju
AnjuAnju
Anju
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Hadoop training kit from lcc infotech
Hadoop   training kit from lcc infotechHadoop   training kit from lcc infotech
Hadoop training kit from lcc infotech
 
Hadoop
HadoopHadoop
Hadoop
 

More from John Sing

Welcome to 2015's Digital Enterprise IT Infrastructure
Welcome to 2015's Digital Enterprise IT Infrastructure   Welcome to 2015's Digital Enterprise IT Infrastructure
Welcome to 2015's Digital Enterprise IT Infrastructure John Sing
 
5 Disruptive Technologies Changing Healthcare - October 2014
5 Disruptive Technologies Changing Healthcare - October 20145 Disruptive Technologies Changing Healthcare - October 2014
5 Disruptive Technologies Changing Healthcare - October 2014John Sing
 
Resume john sing_2015_01_29_executive_it_architect_pre-sales_engineer
Resume john sing_2015_01_29_executive_it_architect_pre-sales_engineerResume john sing_2015_01_29_executive_it_architect_pre-sales_engineer
Resume john sing_2015_01_29_executive_it_architect_pre-sales_engineerJohn Sing
 
2015 IT Roadmap_Driving_Business_Success_v31
2015 IT Roadmap_Driving_Business_Success_v312015 IT Roadmap_Driving_Business_Success_v31
2015 IT Roadmap_Driving_Business_Success_v31John Sing
 
C cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_sing
C cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_singC cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_sing
C cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_singJohn Sing
 
Tutorial on the McKinsey/Harvard "Customer Decision Journey" by John Sing
Tutorial on the McKinsey/Harvard "Customer Decision Journey" by John SingTutorial on the McKinsey/Harvard "Customer Decision Journey" by John Sing
Tutorial on the McKinsey/Harvard "Customer Decision Journey" by John SingJohn Sing
 
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...John Sing
 
To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...
To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...
To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...John Sing
 
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6John Sing
 

More from John Sing (9)

Welcome to 2015's Digital Enterprise IT Infrastructure
Welcome to 2015's Digital Enterprise IT Infrastructure   Welcome to 2015's Digital Enterprise IT Infrastructure
Welcome to 2015's Digital Enterprise IT Infrastructure
 
5 Disruptive Technologies Changing Healthcare - October 2014
5 Disruptive Technologies Changing Healthcare - October 20145 Disruptive Technologies Changing Healthcare - October 2014
5 Disruptive Technologies Changing Healthcare - October 2014
 
Resume john sing_2015_01_29_executive_it_architect_pre-sales_engineer
Resume john sing_2015_01_29_executive_it_architect_pre-sales_engineerResume john sing_2015_01_29_executive_it_architect_pre-sales_engineer
Resume john sing_2015_01_29_executive_it_architect_pre-sales_engineer
 
2015 IT Roadmap_Driving_Business_Success_v31
2015 IT Roadmap_Driving_Business_Success_v312015 IT Roadmap_Driving_Business_Success_v31
2015 IT Roadmap_Driving_Business_Success_v31
 
C cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_sing
C cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_singC cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_sing
C cloud organizational_impacts_big_data_on-prem_vs_off-premise_john_sing
 
Tutorial on the McKinsey/Harvard "Customer Decision Journey" by John Sing
Tutorial on the McKinsey/Harvard "Customer Decision Journey" by John SingTutorial on the McKinsey/Harvard "Customer Decision Journey" by John Sing
Tutorial on the McKinsey/Harvard "Customer Decision Journey" by John Sing
 
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
Cloud_Big_Data_Analytics_Mobile_Social_modern_internet_scale_business_models_...
 
To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...
To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...
To_Infinity_and_Beyond_2012_Big_Data_Internet_Scale_Update_November_2012_v2_J...
 
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
To_Infinity_and_Beyond_Internet_Scale_Workloads_Data_Center_Design_v6
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Hadoop_Its_Not_Just_Internal_Storage_V14

  • 1. © 2013 IBM Corporation Hadoop – It’s Not Just Internal Storage John Sing, Executive Consultant IBM Systems and Technology Group Session 1185A Tuesday, June 11, 2013 11 June 2013
  • 2. © 2013 IBM Corporation IBM Storage Solutions for Big Data 2 John Sing  31 years of experience with IBM in high end servers, storage, and software – 2009 - Present: IBM Executive Strategy Consultant: IT Strategy and Planning, Enterprise Large Scale Storage, Internet Scale Workloads and Data Center Design, Big Data Analytics, HA/DR/BC – 2002-2008: IBM IT Data Center Strategy, Large Scale Systems, Business Continuity, HA/DR/BC, IBM Storage – 1998-2001: IBM Storage Subsystems Group - Enterprise Storage Server Marketing Manager, Planner for ESS Copy Services (FlashCopy, PPRC, XRC, Metro Mirror, Global Mirror) – 1994-1998: IBM Hong Kong, IBM China Marketing Specialist for High-End Storage – 1989-1994: IBM USA Systems Center Specialist for High-End S/390 processors – 1982-1989: IBM USA Marketing Specialist for S/370, S/390 customers (including VSE and VSE/ESA)  singj@us.ibm.com  You may follow my daily IT research blog – http://www.delicious.com/atsf_arizona  You may follow me on Slideshare.net: – http://www.slideshare.net/johnsing1  My LinkedIn: – http://www.linkedin.com/in/johnsing
  • 3. © 2013 IBM Corporation IBM Storage Solutions for Big Data 4 4 Agenda  Understanding today’s Hadoop environments – Hadoop architecture, usage cases, deployments – Hadoop design, performance, and cost considerations  Differing Hadoop perspectives: Applications/Business Line vs. Operations – Understanding implications of direct attached storage (DAS) vs. Shared Storage  Intelligently choosing Hadoop storage solutions – Usage cases where Direct Attached Storage makes sense – Intelligent usage cases where Shared Storage makes sense – Future evolution of storage, Hadoop, and cross-section of the two  IBM Hadoop, Storage, Big Data hardware and software components, tools, offerings Source: If applicable, describe source origin
  • 4. © 2013 IBM Corporation IBM Presentation Template Full Version 55 Understanding today’s Hadoop environments Hadoop – It’s Not Just Internal Storage
  • 5. © 2013 IBM Corporation IBM Storage Solutions for Big Data 6 What is Hadoop? Instead of the traditional IT computation model:  Which brings the data to the function/program on application server  Loads data into memory on an application server and processes it  Unfortunately, this doesn’t scale for internet-scale Big Data problems Apache Hadoop: open source framework for data-intensive applications  Inspired by Google technologies (MapReduce, GFS)  Well-suited to batch-oriented, read-intensive applications  Yahoo! adopted these technologies and open sourced them into the Apache Hadoop project Hadoop has become a pervasive enabler of internet-scale applications, working with thousands of nodes, petabytes of data in highly parallel, cost effective manner  CPU + disks of commodity storage = Hadoop “node”  Hadoop nodes today running mission-critical production in massive clusters  10s of thousands of servers  New nodes can be added as needed to the cluster, without changing:  Data formats, how data is loaded, how jobs are written 6 Tutorials: http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
  • 6. © 2013 IBM Corporation IBM Storage Solutions for Big Data 7 The World of Hadoop: worldwide usage  eBay  Linkedin  Yahoo!  Facebook  New York Times  Many, many more… http://www.datanami.com/datanami/2012-04-26/six_super-scale_hadoop_deployments.html One source for Hadoop users (but not the only one!): http://wiki.apache.org/hadoop/PoweredBy
  • 7. © 2013 IBM Corporation IBM Storage Solutions for Big Data 8 Hadoop is today a well-developed ecosystem  Hadoop – Overall name of software stack  HDFS – Hadoop Distributed File System  MapReduce – Software compute framework • Map = queries • Reduce=aggregates answers  Hive – Hadoop-based data warehouse  Pig – Hadoop-based language  Hbase – Non-relationship database fast lookups  Flume – Populate Hadoop with data  Oozie – Workflow processing system  Whirr – Libraries to spin up Hadoop on Amazon EC2, Rackspace, etc.  Avro – Data serialization  Mahout – Data mining  Sqoop – Connectivity to non-Hadoop data stores  BigTop – Packaging / interop of all Hadoop components http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/ http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
  • 8. © 2013 IBM Corporation IBM Storage Solutions for Big Data 9 Hadoop vendor ecosystem today http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://www.forbes.com/special-report/2013/industry-atlas.html
  • 9. © 2013 IBM Corporation IBM Storage Solutions for Big Data 10 Why understand the Hadoop stack and environment?  Hadoop is being used for much more than just internet-scale Big Data analytics  Hadoop is increasingly being used by enterprises for inexpensive data storage – As an industry we’re strongly exploiting a much wider variety of data types – With tools like Hadoop, it’s become affordable to ingest, analyze, have available an internet-scale “Big Landing Zone” Hadoop cluster for storing data • Previously not viable to keep online – Hadoop cluster also then can run internet-scale analytics on this data – Significant driver: move to Hadoop to reduce traditional database licensing costs  Storage industry dynamics: – Today, JBOD storage in a server chassis might be as low as 4-6 cents/raw GB • At these prices, adding 50TB usable to Hadoop cluster might only cost $10K in total including server • Even at typical Hadoop 3X copies, this is still less initial cost than enterprise storage at 26 cents/GB • Not saying this includes all factors, but these dynamics clearly affect the decision – And then, there’s flash storage coming….. Must understand full depth of the Hadoop environment and storage industry dynamics: – In order to decide if/when/where Hadoop internal storage or shared storage is appropriate
  • 10. © 2013 IBM Corporation IBM Storage Solutions for Big Data 11 Why Hadoop was created for Big Data Traditional approach : Move data to program Big Data approach: Move function/programs to data Database server Data Query Data return Data process Data Master node Data nodes Data Application server User request Send result User request Send Function to process on Data Query & process Data Data nodes Data Data nodes Data Data nodes Data Send Consolidate result Traditional approach Application server and Database server are separate Data can be on multiple servers Analysis Program can run on multiple Application servers Network is still in the middle Data has to go through network •Big Data Approach  Analysis Program runs where the data is : on Data Node Only Analysis Program has to go through the network Analysis Program need to be MapReduce aware Highly Scalable : 1000s Nodes Petabytes and more Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
  • 11. © 2013 IBM Corporation IBM Storage Solutions for Big Data 12 Example of Hadoop in action Traditional approach : Move data to program Database server Data Query Data return Data process Data Application server User request Send result  Big Data approach : Move program to Data Master node Data nodes Data User request Send Function to process on Data Query & process Data Data nodes Data Data nodes Data Data nodes Data Send Consolidate result Example : How many hours Clint Eastwood appears in all the movies he has done ? All movies need to be parsed to find Clint’s face Traditional approach : All movies are uploaded to application server, through the network • Big Data Approach : The Analysis Program and copy of Clint’s picture are downloaded to data nodes, through the network Thank you to: Pascal VEZOLLE/France/IBM@IBMFR and Francois Gibello/France/IBM for the use of this slide
  • 12. © 2013 IBM Corporation IBM Storage Solutions for Big Data 13 Hadoop principles: Storage, HDFS and MapReduce  Hadoop Distributed File System = HDFS : where Hadoop stores the data – HDFS file system spans all the nodes in a cluster with locality awareness  Hadoop data storage, computation model – Data stored in a distributed file system, spanning many inexpensive computers – Send function/program to the data nodes – i.e. distribute application to compute resources where the data is stored – Scalable to thousands of nodes and petabytes of data MapReduce Application 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set) Return a single result setResult Set Shuffle public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get(); . . . public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get(); . . . Distribute map tasks to cluster Hadoop Data Nodes Data is loaded, spread, resident in Hadoop cluster Performance = tuning Map Reduce workflow, network, application, servers, and storage http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/ http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/allenwittenauer/2012-lihadoopperf
  • 13. © 2013 IBM Corporation IBM Storage Solutions for Big Data 14 Big Data hadoop system architecture Data node Data node Data node Management nodes Namenode nodes JobTracker nodes Data nodes with local disks Network 1-10GB Ethernet or Infiniband Network 1-10GB Ethernet or Infiniband Management nodes for Hadoop and cluster Management nodes for Hadoop and cluster IO performance = type and # of disks Reference architecture: From 12-24 disks, ~1.5GB/s, >35TB, 12-16 CPUs per datanode Hadoop Distributed File System (HDFS) • HDFS stores data across multiple data nodes, Namenode knows where data is • HDFS assumes data nodes and disks will fail, so it achieves reliability by replicating data across multiple data nodes (typically 3 or more) • HDFS file system is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS • HDFS Name Node is a single point of failure Scaling granularity: data node, scaling both IO and CPU Locality awareness Note: any other location for data adds network latency
  • 14. © 2013 IBM Corporation IBM Presentation Template Full Version 1515 Differing Hadoop storage perspectives Hadoop – It’s Not Just Internal Storage
  • 15. © 2013 IBM Corporation IBM Storage Solutions for Big Data 16 Understanding Hadoop rationale for Direct Attached Storage (latency) Primary Hadoop design goal is affordability at internet scale:  Data is loaded into Hadoop cluster with data locality – Spreading data across Data Nodes – Achieve lowest disk latency through direct attached storage  Send programs to data (not other way around) – Data in general does not move within the Hadoop cluster  Key performance components: disk latency, network interconnect, utilization, bandwidth  Based on low capital expenditure, low cost commodity components – Goal: lowest capital cost at scale (adapters, switches and # of ports) Hadoop Application and performance tuning:  Fallacy: “all Hadoop jobs are IO-bound”  Truth: there are many many Hadoop workflow and tuningvariables, widely varying workloads – CPU/storage ratio different for different workloads Network latency is major performance impact on Hadoop cluster – Adding external storage layer network latency causes major retuning of network Hadoop Application team to Operations: “Until you’ve read the Hadoop book, please don’t waste my time” http://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/ http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/
  • 16. © 2013 IBM Corporation IBM Storage Solutions for Big Data 17 Yet, there are valid operational issues with Hadoop from Enterprise Shared Storage management, cost standpoint  Servers under-utilized?  Another storage silo?  Amount of physical storage required per usable GB/TB?  Reliability as Hadoop application goes into mission critical production?  Hadoop-specific storage management, migration, backup, recovery?  Hadoop-specific skill set?  Ability to understand what data is used where?  Audit, security, legacy application integration?  Share Hadoop storage (and servers) dynamically, in a pool with other data center resources? Ultimately, it becomes a matter of perspective, type of infrastructure, and associated priority. Let’s explore this further………..
  • 17. © 2013 IBM Corporation IBM Storage Solutions for Big Data 18 Today: two different types of IT Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Internet scale wkloadsTransactional IT
  • 18. © 2013 IBM Corporation IBM Storage Solutions for Big Data 19 Today’s two major IT workload types Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads
  • 19. © 2013 IBM Corporation IBM Storage Solutions for Big Data 20 How to build these two different clouds Source: http://it20.info/2012/02/the-cloud-magic-rectangle-tm/ Transactional IT Internet scale wkloads
  • 20. © 2013 IBM Corporation IBM Storage Solutions for Big Data 21 Hadoop storage choices based on perspective: This is where a Hadoop external shared storage infrastructure may often be found This is where Hadoop DAS-focused infrastructure may often be found
  • 21. © 2013 IBM Corporation IBM Storage Solutions for Big Data 22 Differing valid perspectives on Hadoop storage issues Very specific reasons why Direct Access Storage is used:  Performance and throughput (lowest latency)  Low cost commodity components – cost of JBOD at 4-6 cents/GB today – Even at 3x copies, still very inexpensive  Many Hadoop workflow, software components to tune: – Map and Reduce workflow – Memory allocation and usage – Algorithms, tuning at all levels – What are the tasks doing  Hadoop overall cluster configuration – Server and DAS storage configuration – 3X copies for performance reasons – Squeeze out all latency – Network topology, speeds, utilization – Compression – Type of data  Etc…..…… Very specific reasons why shared storage is desired:  Cost CAPEX / OPEX? – Fixed server/storage ratio? – Low server % utilization = excess cost?  Reliability?  Backup? Disaster Recovery?  Another silo of storage?  Managing data: – Within the Hadoop cluster – Between Hadoop and other existing storage? Hadoop Applications, Business Line team Operations team Clearly, different perspectives!
  • 22. © 2013 IBM Corporation IBM Storage Solutions for Big Data 23 Bottom line on Direct Attached Storage (DAS) vs. Shared Storage for Hadoop  Avoid “brute force” one-for-one direct replacement of Hadoop direct attached storage with external shared storage – This is too blunt an instrument • Doesn’t intelligently consider Hadoop design characteristics, performance requirements, overall Hadoop cluster tuning, workload variations, customer’s environment  Instead, an intelligent, blended Hadoop storage approach, with full awareness of the Hadoop stack and customer environment, and multiple perspectives: – To identify cases where Direct Attached Storage (DAS) makes sense • Many Hadoop cases where DAS is the correct Hadoop primary storage choice • For issues of very large scale, performance and throughput, minimize network, adapter costs – To identify cases where shared storage makes sense • While maintaining the Hadoop benefits of DAS latency, cost, scale • Specific intelligent implementations are effective, if designed properly with full Hadoop stack awareness  Without an intelligent in-depth Hadoop-aware approach: – Likely may not meet Hadoop performance or cost objectives • Replacing DAS one-for-one with external shared storage today isn’t cost-effective at true internet scale • SAN switches / port costs today cannot affordably reach thousands of data nodes – Must use intelligent approach, otherwise SAN/NAS will introduce significant % disk IO latency increase • Requiring rebalancing of entire Hadoop cluster and requiring more expensive networking costs http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf
  • 23. © 2013 IBM Corporation IBM Presentation Template Full Version 2424 Intelligently choosing Hadoop storage solutions Hadoop – It’s Not Just Internal Storage
  • 24. © 2013 IBM Corporation IBM Storage Solutions for Big Data 25 Intelligently using Hadoop shared storage: goals  Wish to perform mixed workloads on a shared storage infrastructure – Some storage for Hadoop, other storage for other things, all on the same storage devices  Have a desire to trade off reduced number of Hadoop copies by exploiting higher storage reliability – Saving on total Hadoop physical storage space  Exploit external storage placement/migration/storage mgmt strategy and capabilities  Exploit configurable storage recovery policies, backup/restore  Exploit your existing storage infrastructure in balanced, cost-effective way  Reduce need for Hadoop storage allocation skills and manual management of Hadoop data  Exploit existing shared storage infrastructure tooling / performance monitors  Add audit, security, legacy integration opportunities leveraged out of existing infrastructure – Avoiding silo’d Hadoop storage environment  Decoupling servers from storage: – Enable using smaller servers (less power, cooling) – Enable better use of resources on differing workloads with differing server/storage ratios – Dynamically allocate servers and storage to work on differing and changing analytics workloads
  • 25. © 2013 IBM Corporation IBM Storage Solutions for Big Data 26 Intelligent usage cases for shared external storage in Hadoop  Intelligent usage cases where external shared storage supplements and is appropriate for Hadoop:  Stage 1: – Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary storage while still preserving Direct Attach Storage data locality, ability for internet scale – While using external storage to bring desired function or reduce number of Hadoop copies – Examples: Nseries Open Solution for Hadoop; GPFS File Placement Optimizer  Stage 2: – Augment Hadoop DAS primary storage with 2nd storage layer (external file system, NAS, or SAN) as a data protection or archival layer. – Intelligently allocating, importing, exporting data appropriately  Stage 3: – Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN) – Appropriate for certain clusters and certain Hadoop environments where: • Network rebalancing, adapter/network costs, scale are in line with shared storage benefits • Example: IBM GPFS Storage Server Stage 3 Stage 1 Stage 2 Hadoop Stages originally published by John Webster, Evaluator Group, http://www.evaluatorgroup.com/about/principals/ http://searchstorage.techtarget.com/video/Alternatives-to-DAS-in-Hadoop-storage http://searchstorage.techtarget.com/answer/Can-shared-storage-be-used-with-Hadoop-architecture http://searchstorage.techtarget.com/video/Understanding-storage-in-the-Hadoop-cluster
  • 26. © 2013 IBM Corporation IBM Storage Solutions for Big Data 27 IBM Big Data Networked Storage Solution for Hadoop http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html Stage 1 Example: IBM DCS3700 with Hadoop replication count = 2 Still direct attached data locality
  • 27. © 2013 IBM Corporation IBM Storage Solutions for Big Data 28 IBM Big Data Network Storage Solution for Hadoop http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html Stage 1 Hadoop Storage building blocks IBM Storage Hadoop replication count = 2 Hadoop Improved Namenode protection
  • 28. © 2013 IBM Corporation IBM Storage Solutions for Big Data 29 Another option: Hadoop environment using IBM GPFS-FPO (File Placement Optimizer) MapReduce Cluster M a p R e d u c e M a p R e d u c e UsersJobs G P F S - F P O  GPFS File Placement Optimzer instead of HDFS - still places disk local to each server  Aggregates the local disk space into a single redundant shared file GPFS system  Designed for MapReduce workloads  Unlike HDFS, GPFS-FPO is POSIX compliant – so data maintenance is easy  Intended as a drop in replacement for open source HDFS (IBM BigInsights product may be required) Stage 1 IBM General Parallel File System FPO Instead of HDFS
  • 29. © 2013 IBM Corporation IBM Storage Solutions for Big Data 30 GPFS 3.5 HDFS Performance Terasort: large reads   Hbase: small write   Metadata intensive   Enterprise readiness POSIX compliance  Meta-data replication  Distributed name node  Protection & Recovery Snapshot  Asynchronous Replication  Backup  Security & Integrity Access Control Lists  Ease of Use Policy based Ingest  GPFS File Placement Optimizer shared storage advantages in Hadoop environment Stage 1
  • 30. © 2013 IBM Corporation IBM Storage Solutions for Big Data 31 Augment Hadoop Storage with external storage Data node Data node Data node Management nodes Namenode nodes JobTracker nodes Compute node Compute node Compute node Compute node Management nodes Job submission nodes Batch scheduler nodes HDFS External storage Possibilities: •Allocate one of Hadoop copies externally •Move data back and forth between Hadoop and external storage Stage 2
  • 31. © 2013 IBM Corporation IBM Storage Solutions for Big Data 32 Another option: augment Hadoop with IBM General Parallel File System in “Stage 2” configuration Data node Data node Data node Management nodes Namenode nodes JobTracker nodes Compute node Compute node Compute node Compute node Management nodes Job submission nodes Batch scheduler nodes GPFS Storage Server GPFS Storage server GPFS-FPO POSIX GPFS Add GPFS Cluster POSIX world All nodes can write/read data • Integration with existing or new external GPFS cluster • Policy based file movement in/out of GPFS-File Placement Optimizer pool • Seamlessly integrate tape as part of the same namespace Stage 2
  • 32. © 2013 IBM Corporation IBM Storage Solutions for Big Data 33 Replace Hadoop DAS with intelligent external Hadoop storage implementation Compute node1 Compute node3 Compute node2 Namenode nodes JobTracker nodes GPFS Storage Server GPFS Storage server /gpfs/node1/dsk1 /gpfs/node1/dsk2 … /gpfs/node1/dskX /gpfs /gpfs/node2/dsk1 /gpfs/node2/dsk2 … /gpfs/node2/dskX /gpfs/node3/dsk1 /gpfs/node3/dsk2 … /gpfs/node3/dskX HDFS Stage 3 Example: GPFS Storage Server
  • 33. © 2013 IBM Corporation IBM Storage Solutions for Big Data 34 IBM Big Data Network Storage Solution for Hadoop http://www.redbooks.ibm.com/redpieces/abstracts/redp5010.html Stage 3 Hadoop Improved Namenode protection Hadoop Storage building blocks Other IBM Storage Hadoop replication count = 2NAS SAN IBM NAS filer NAS SAN
  • 34. © 2013 IBM Corporation IBM Storage Solutions for Big Data 35 Future evolution: Hadoop, storage, intersection of the two  Continued evolution of Big Data workloads, Hadoop, and storage are all fast moving targets – Already in mid-2013, we’re seeing HDFS 2.0 offering HA, snapshots, better resiliency • http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum – We are seeing a huge adoption rate of Hadoop as inexpensive cheap, deep storage  More importantly, very soon flash storage costs will start to affect Hadoop reference architectures – By 2015, costs on SSD will reach point (15 cents/GB) that future yet-to-be-determined Hadoop deployments – Will start move Hadoop bottleneck from storage to network interconnect – Whoever best solves that future network interconnect issue will be the next big Hadoop winner  Today’s intelligent Hadoop usage cases will continue to evolve quickly. Watch this space!
  • 35. © 2013 IBM Corporation IBM Presentation Template Full Version 3636 IBM Hadoop Storage components, tools, offerings Hadoop – It’s Not Just Internal Storage
  • 36. © 2013 IBM Corporation IBM Storage Solutions for Big Data 37 Big Data application stack User Interface Layer Reports, Dashboards, Mashups, Search, Ad hoc reporting, Spreadsheets Analytic Process Layer Real-time computing and analysis, stream computing, entity analytics, data mining, data proximity, content management, text analytics, etc. Infrastructure layer Virtualization, central end to end management, control, deployment on software, server, storage in a geographically dispersed environment Users Security authorization OS software Location of competitive advantage Analytics applications. Cloud infrastructure layer Servers, storage IBM Big Data Software Visualization layer Analytics layer
  • 37. © 2013 IBM Corporation IBM Storage Solutions for Big Data 38 IBM Big Data Analytics Solutions Streaming Data Traditional Warehouse Analytics on Data at Rest Data Warehouse Analytics on Structured Data Analytics on Data In-Motion IBM InfoSphere BigInsights Traditional / Relational Data Sources Non-Traditional / Non-Relational Data Sources Non-Traditional/ Non-Relational Data Sources Traditional/Relational Data Sources Internet-Scale Data Sets IBM InfoSphere Streams
  • 38. © 2013 IBM Corporation IBM Storage Solutions for Big Data 39 Big Data infrastructure layer User Interface Layer Reports, Dashboards, Mashups, Search, Ad hoc reporting, Spreadsheets Analytic Process Layer Real-time computing and analysis, stream computing, entity analytics, data mining, data proximity, content management, text analytics, etc. Infrastructure layer Virtualization, central end to end management, control, deployment on software, server, storage in a geographically dispersed environment Users Security authorization OS software Cloud infrastructure layer Servers, storage Visualization layer Analytics layer
  • 39. © 2013 IBM Corporation IBM Storage Solutions for Big Data 40 IBM Direct Attached Storage solutions for Hadoop Rack-Level Features Up to 20 System x3630 M4 nodes Up to 6 System x3550 M4 Management nodes Up to 960TB storage Up to 240 Intel Sany Bridge cores Up to 3,840GB memory Up to two 10Gb Ethernet (IBM G8264-T) switches Scalable to multi-rack configurations Available Enterprise and Performance Features Redundant storage Redundant networking High performance cores Increased memory High performance networking Reference architecture High volume x86 systems Integrated solution PureData System for Hadoop Each system has local storage
  • 40. © 2013 IBM Corporation IBM Storage Solutions for Big Data 41 JBOD Disk Enclosure x3650 M4 Server Storage solution includes Data Servers, Disk (2TB or 3TB NL-SAS, SSD), Software, InfiniBand / Ethernet with no Storage Controllers GSS 24: Light and Fast 2 3650 servers + 4 JBOD 20U rack 10 GB/Sec GSS 26: Workhorse 2 3650 servers + 6 JBOD Enclosures, 28U 12 GB/sec High-Density Option 6 3650 servers + 18 JBOD 2 - 42U Standard Racks 36 GB/sec IBM external Big Data storage: GPFS Storage Server scalable building block approach GPFS software RAID
  • 41. © 2013 IBM Corporation IBM Storage Solutions for Big Data 42 High Volume & Availability : Mainframe & Open Storage for Distributed Systems Storage management SW Tivoli Storage Productivity Center Tivoli Storage FlashCopy Manager Tivoli Storage Manager Tivoli Key Lifecycle Manager XIV SONASDS8000 Optimized System Storage N seriesStorwize V7000 Unified Storwize V7000 Integrated Innovation Storage Virtualization SW and SVC Real-time Compression Déduplication DS3500/DCS3700 Integrated Solutions Virtual Storage Center Easy Tier IBM Active Cloud EngineTM Linear Tape File System (LTFS) IBM Shared Storage infrastructure solutions V7000 Unified V7000 Unified V7000 Unified V7000 Unified Tape Library TS3310 Tape Virtualization TS7740 Tape Automation TS3500 Tape drives LTO 3, 4 and 5 ProtecTIER TS7610/20/50 Data protection & retention
  • 42. © 2013 IBM Corporation IBM Storage Solutions for Big Data 43 IBM solutions for a Big Data world IBM Netezza Storwize V7000 “Unified” Storage “File” Storage “Block” Storage Disks 3TB, 4 TB • Storwize V7000 • XIV Gen3 • DS8800 Solid State Drives (SDD) • Storwize V7000 • XIV Gen3 • DS8800 Scale Out NAS (SONAS) IBM Tape Systems 2.7 ExaBytes TS3500 InfoSphere Streams GPFS Storage Server
  • 43. © 2013 IBM Corporation IBM Storage Solutions for Big Data 44 Learning Points  Many, most cases where traditional Hadoop Direct Attached Storage is appropriate  However, many Intelligent usage cases where Hadoop external shared storage, intelligently implemented, brings significant value  Stage 1: – Intelligently directly attach larger external storage arrays, or external filesystems, for Hadoop primary storage while still preserving Direct Attach Storage data locality, ability for internet scale – While using external storage to bring desired function or reduce number of Hadoop copies  Stage 2: – Augment Hadoop DAS primary storage with 2nd storage layer (external file system, NAS, or SAN) as a data protection or archival layer – Intelligently allocating, importing, exporting data appropriately  Stage 3: – Directly replace primary node-based DAS with external shared storage (file system, NAS or SAN) – Appropriate for certain clusters and certain Hadoop environments where: • Network rebalancing, adapter/network costs, scale are in line with shared storage benefits  Most importantly, Hadoop and Storage topic is both fast moving, constantly evolving – Soon, adoption of Hadoop primary flash storage will significantly change Hadoop dynamics – Will move Hadoop bottleneck from storage to network
  • 44. © 2013 IBM Corporation IBM Storage Solutions for Big Data 45
  • 45. © 2013 IBM Corporation IBM Storage Solutions for Big Data 46 Trademarks and disclaimers © IBM Corporation 2011. All rights reserved. References in this document to IBM products or services do not imply that IBM intends to make them available in every country. Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Linear Tape-Open, LTO, the LTO Logo, Ultrium, and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries. Other product and service names might be trademarks of IBM or other companies. Information is provided "AS IS" without warranty of any kind. The customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Information concerning non-IBM products was obtained from a supplier of these products, published announcement material, or other publicly available sources and does not constitute an endorsement of such products by IBM. Sources for non-IBM list prices and performance numbers are taken from publicly available information, including vendor announcements and vendor worldwide homepages. IBM has not tested these products and cannot confirm the accuracy of performance, capability, or any other claims related to non-IBM products. Questions on the capability of non-IBM products should be addressed to the supplier of those products. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Some information addresses anticipated future capabilities. Such information is not intended as a definitive statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future products. Such commitments are only made in IBM product announcements. The information is presented here to communicate IBM's current investment and development activities as a good faith effort to help with our customers' future planning. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput or performance improvements equivalent to the ratios stated here. Prices are suggested U.S. list prices and are subject to change without notice. Starting price may not include a hard drive, operating system or other features. Contact your IBM representative or Business Partner for the most current pricing in your geography. Photographs shown may be engineering prototypes. Changes may be incorporated in production models. Trademarks of International Business Machines Corporation in the United States, other countries, or both can be found on the World Wide Web at http://www.ibm.com/legal/copytrade.shtml. ZSP03490-USEN-00
  • 46. © 2013 IBM Corporation IBM Storage Solutions for Big Data 47 Appendix
  • 47. © 2013 IBM Corporation IBM Storage Solutions for Big Data 48 Recommend you download, read, this very informative IBM book  ”Understanding Big Data” – Published April 2012 – Free download – Well worth reading to understand components of Big Data, and how to exploit  Part 1: The Big Deal about Big Data – Chapter 1 – What is Big Data? Hint: You’re a Part of it Every Day – Chapter 2 – Why Big Data is Important – Chapter 3 – Why IBM for Big Data  Part II: Big Data: From the Technology Perspective – Chapter 4 - All About Hadoop: The Big Data Lingo Chapter – Chapter 5 – IBM InfoSphere Big Insights – Analytics for “At Rest” Big Data – Chapter 6 – IBM InfoSphere Streams – Analytics for “In Motion” Big Data http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF Download your free copy here
  • 48. © 2013 IBM Corporation IBM Storage Solutions for Big Data 49 IBM InfoSphere BigInsights = IBM Hadoop distribution Core Hadoop BigInsights Basic Edition BigInsights Enterprise Edition Free download with web support Limit to <= 10 TB of data (Optional: 24x7 paid support Fixed Term License) Professional Services Offerings QuickStart, Bootcamp, Education, Custom Development Enterprise-grade features: Tiered Terabyte-based pricing Easy Installation And programming Analytics tooling/visualization Administration tooling Development tooling High Availability Flexible storage Recoverability Security

Editor's Notes

  1. Traditional applications work on the model where data is loaded into memory from wherever it is stored onto the computer where the application is run. As Google was processing ever increasing amounts of internet data, the IT people there quickly realized that this centralized approach to computation was not sustainable. So they decided to move to a model where they would scale out their processing and storage and created a system where the data would be processed on the machine where it is stored. This processing technology became MapReduce and the storage model is known as the Google File System (GFS), which is a direct descendant to today’s HDFS. Hadoop is a top-level Apache project being built and used by a global community of contributors. Yahoo has been the largest contributor to the project, and it uses Hadoop extensively across its businesses. One of its employees, Doug Cutting, reviewed key papers from Google and concluded that the technologies they described could solve the scalability problems of Nutch, an open source Web search technology. So Cutting led an effort to develop Hadoop (which, incidentally, he named after his son’s stuffed elephant). Hadoop is particularly well-suited to batch-oriented, read-intensive applications. Key features include the ability to distribute and manage data across a large number of nodes and disks. By using the MapReduce programming model with the Hadoop framework, programmers can create applications that automatically take advantage of parallel processing. A single commodity box consisting of, let’s say, a single CPU and disk, forms a node in Hadoop. Such boxes can be combined into clusters, and new nodes can be added to a cluster without an administrator or programming changing the format of the data, how the data was loaded, or how the jobs (programming logic) were written. The following overview of Hadoop was extracted from the Hadoop wiki at http://wiki.apache.org/hadoop/ Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
  2. http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ http://www.hadoopwizard.com/which-big-data-company-has-the-worlds-biggest-hadoop-cluster/
  3. http://wikibon.org/wiki/v/Big_Data:_Hadoop%2C_Business_Analytics_and_Beyond A Hadoop “stack” is made up of a number of components. They include: Hadoop Distributed File System (HDFS): The default storage layer in any given Hadoop cluster; Name Node: The node in a Hadoop cluster that provides the client information on where in the cluster particular data is stored and if any nodes fail; Secondary Node: A backup to the Name Node, it periodically replicates and stores data from the Name Node should it fail; Job Tracker: The node in a Hadoop cluster that initiates and coordinates MapReduce jobs, or the processing of the data. Slave Nodes: The grunts of any Hadoop cluster, slave nodes store data and take direction to process it from the Job Tracker. In addition to the above, the Hadoop ecosystem is made up of a number of complimentary sub-projects. NoSQL data stores like Cassandra and HBase are also used to store the results of MapReduce jobs in Hadoop. In addition to Java, some MapReduce jobs and other Hadoop functions are written in Pig, an open source language designed specifically for Hadoop. Hive is an open source data warehouse originally developed by Facebook that allows for analytic modeling within Hadoop. Following is a guide to Hadoop&apos;s components: Hadoop Distributed File System:  HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data. MapReduce:  MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query. Hive:  Hive is a Hadoop-based data warehouse developed by Facebook. It allows users to write queries in SQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc. Pig:  Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.) HBase:  HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily. Flume:  Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop. Oozie:  Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed. Whirr:  Whirr is a set of libraries that allows users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace or any virtual infrastructure. It supports all major virtualized infrastructure vendors on the market. Avro:  Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls. Mahout:  Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model. Sqoop:  Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target. BigTop:  BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop&apos;s sub-projects and related components with the goal improving the Hadoop platform as a whole.
  4. 5 cents / GB = $50/TB = $500/10TB = $2500/50TB x 3 Hadoop copies = $7500 storage in Hadoop server + $2500 for Hadoop server = $10,000 26 cents / GB = $260/TB = $2600/10TB = $13,000/50TB = $13,000 storage in external array + cost of SAN or NAS network + cost of rebalancing Hadoop cluster for equivalent performance
  5. There are two aspects of Hadoop that are important to understand: MapReduce is a software framework introduced by Google to support distributed computing on large data sets of clusters of computers. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data. This file system spans all the nodes in a cluster. Effectively, HDFS links together the data that resides on many local nodes, making the data part of one big file system. Furthermore, HDFS assumes nodes will fail, so it replicates a given chunk of data across multiple nodes to achieve reliability. The degree of replication can be customized by the Hadoop administrator or programmer. However, by default is to replicate every chunk of data across 3 nodes: 2 on the same rack, and 1 on a different rack. You can use other file systems with Hadoop, but HDFS is quite common. (ex GPFS) The key to understanding Hadoop lies in the MapReduce programming model. This is essentially a representation of the divide and conquer processing model, where your input is split into many small pieces (the map step), and the Hadoop nodes process these pieces in parallel. Once these pieces are processed, the results are distilled (in the reduce step) down to a single answer.
  6. http://www.snia.org/sites/default/education/tutorials/2012/fall/big_data/DrSamFineberg_Hadoop_Storage_Options-v21.pdf
  7. http://www.forbes.com/sites/johnwebster/2013/02/27/hadoop-appliances-the-lengthening-list/
  8. The filer is present just for completeness. It has no part in the story we are telling about GPFS or GPFS-FPO replacement for HDFS. No need to talk to it, and in your personal copies of the deck, if you wish to remove it, that is fine.
  9. Policy based ingest – move data into and out of FPO pool using GPFS policies.
  10. Summary: When you break it down even further, IBM has constructed a portfolio of software and solutions with the breadth and depth to meet all of the needs of all organizations today, combined with unique synergies across this portfolio that enable organizations to start with their most pressing needs knowing that they will be able to leverage their skills and investment in future projects to reduce risk, lower costs and achieve faster time to value in meeting the needs of the business. There are multiple “entry-points”, driven by your most pressing needs, that help you start moving down the path for an information-led transformation. (Note: describe this slide from the bottom-up ) When you think about an information-led transformation, you need to ensure that your infrastructure and systems are optimized to handle the various workloads that are demanded of it. Especially today when you are faced with a glut of new information, you need to ensure that relevant information is available, that it is secure and that you are able to retrieve it in a timely manner not only for analytical, operational and transactional systems, but also for regulatory compliance. That is why IBM Software Group and our Systems &amp; Technology Group are working together to provide optimized solutions focused on delivering greater business value to our customers, faster, for increased return on investment. From the new IBM Smart Analytics System, to the new DB2 PureScale for continuous availability, unlimited capacity and application transparency, to the deep integration of System z, IBM has unparalleled expertise in designing and implementing workload optimized systems and services. On top of that infrastructure, there is also the need to ensure that you can bring all of those sources of information together to create a single, trusted view of information from across your business – regardless of whether that information is structured or unstructured – and then manage it over time. From data warehousing, Master Data Management, information integration, and Agile ECM and integrated data management, IBM’s InfoSphere portfolio ensures that organizations will be able to leverage their information over time to drive innovation across their business. And armed with this single-view of your business, you can then look to optimize business processes and drive greater performance across your organization. Decision makers will have the right information, at the right time, in the right context to make better, more informed decisions, and even anticipate new opportunities or counter potential threats more effectively. The Business Analytics and Optimization Platform supports and information-led transformation in that it focuses on establishing well-constructed processes and empowering individuals throughout the organization with pervasive, predictive real-time analytics . From Cognos and the newly acquired SPSS portfolios, organizations can now be more pro-active and predictive in innovating their business.
  11. The IBM Big Data Platform extends the traditional warehouse in two ways: Big Data in Motion , which is streaming data, such as securities data (like stock tickers), or sensor data, (like temperature readings, heart rates, or the revolutions per second of a piece of machinery). This data can be streaming at a very high transfer rate, and vary greatly in its structure. Our product offering for Big Data in Motion is InfoSphere Streams. This product is capable of performing analytics on the streaming data in real-time. Big Data at Rest , which is a set of data in static storage, for instance a large set of log files from a web site’s click-stream analysis, or pools of raw text from service engagements with customers. Our product offering for Big Data at Rest is InfoSphere BigInsights. This product is capable of performing analytics on this large set of varied data. Both Streams and BigInsights interface with each-other, and can use existing data warehouses as data sources for their analytics. Or the data warehouse can pull data from Streams and BigInsights. Transcript : BIG Data&apos;s all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : BIG Data&apos;s all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: 06/25/13 Transcript : BIG Data&apos;s all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes: Prensenter name here.ppt Transcript : BIG Data&apos;s all about internet scale. In fact, I think these two ideas sort of end up being kind of synonymous. When everybody thinks of BIG Data, they think of internet data. They think of all the external data sources that can be pulled together. And we in fact are combining our capability in data warehousing together with this internet scale capability around Hadoop MapReduce Author’s Original Notes:
  12. Summary: When you break it down even further, IBM has constructed a portfolio of software and solutions with the breadth and depth to meet all of the needs of all organizations today, combined with unique synergies across this portfolio that enable organizations to start with their most pressing needs knowing that they will be able to leverage their skills and investment in future projects to reduce risk, lower costs and achieve faster time to value in meeting the needs of the business. There are multiple “entry-points”, driven by your most pressing needs, that help you start moving down the path for an information-led transformation. (Note: describe this slide from the bottom-up ) When you think about an information-led transformation, you need to ensure that your infrastructure and systems are optimized to handle the various workloads that are demanded of it. Especially today when you are faced with a glut of new information, you need to ensure that relevant information is available, that it is secure and that you are able to retrieve it in a timely manner not only for analytical, operational and transactional systems, but also for regulatory compliance. That is why IBM Software Group and our Systems &amp; Technology Group are working together to provide optimized solutions focused on delivering greater business value to our customers, faster, for increased return on investment. From the new IBM Smart Analytics System, to the new DB2 PureScale for continuous availability, unlimited capacity and application transparency, to the deep integration of System z, IBM has unparalleled expertise in designing and implementing workload optimized systems and services. On top of that infrastructure, there is also the need to ensure that you can bring all of those sources of information together to create a single, trusted view of information from across your business – regardless of whether that information is structured or unstructured – and then manage it over time. From data warehousing, Master Data Management, information integration, and Agile ECM and integrated data management, IBM’s InfoSphere portfolio ensures that organizations will be able to leverage their information over time to drive innovation across their business. And armed with this single-view of your business, you can then look to optimize business processes and drive greater performance across your organization. Decision makers will have the right information, at the right time, in the right context to make better, more informed decisions, and even anticipate new opportunities or counter potential threats more effectively. The Business Analytics and Optimization Platform supports and information-led transformation in that it focuses on establishing well-constructed processes and empowering individuals throughout the organization with pervasive, predictive real-time analytics . From Cognos and the newly acquired SPSS portfolios, organizations can now be more pro-active and predictive in innovating their business.
  13. IBM offers two basic models – the GSS24 and GSS26 – with 4 or 6 JBODs, respectively. These two basic configurations are scalable into larger storage solutions by using them as building blocks.
  14. IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. Apache Hadoop is the open source software framework, used to reliably managing large volumes of structured and unstructured data. BigInsights enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is that you get a more developer and user-friendly solution for complex, large scale analytics. InfoSphere BigInsights allows enterprises of all sizes to cost effectively manage and analyze the massive volume, variety and velocity of data that consumers and businesses create every day. Infosphere Streams Part of IBM’s platform for big data, IBM InfoSphere Streams allows you to capture and act on all of your business data... all of the time... just in time. InfoSphere Streams radically extends the state-of-the-art in big data processing; it’s a high-performance computing platform that allows user-developed applications to rapidly ingest, analyze, &amp; correlate information as it arrives from thousands of real-time sources. Users are able to: Continuously analyze massive volumes of data at rates up to petabytes per day. Perform complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, police scanners, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information that is relevant to your business. Leverage sub-millisecond latencies to react to events and trends as they are unfolding, while it is still possible to improve business outcomes. Adapt to rapidly changing data forms and types. Seamlessly deploy applications on any size computer cluster. Meet current reaction time and scalability requirements with the flexibility to evolve with future changes in data volumes and business rules. Quickly develop new applications that can be mapped to a variety of hardware configurations, and adapted with shifting priorities. Provides security and information confidentiality for shared information. Learn more about how InfoSphere Streams aligns with any industry.
  15. http://www.forbes.com/sites/johnwebster/2013/02/27/hadoop-appliances-the-lengthening-list/
  16. Thank you!
  17. Link to enter your email address and then get free copy of this book downloaded: https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=sw-infomgt&amp;S_PKG=500016891&amp;S_CPM=is_bdebook1_biginsightsfp Direct URL to load book (3.5 MB Acrobat Reader file): http://public.dhe.ibm.com/common/ssi/ecm/en/iml14297usen/IML14297USEN.PDF
  18. InfoSphere BigInsights features Apache Hadoop as a core component. There are two releases of InfoSphere BigInsights: Basic and Enterprise. Basic edition –This is a free offering. It has open source components as well as IBM value add (maintenance console, DB2 integration, integrated installation) You can purchase support for this offering. This is an excellent choice for companies who want a Hadoop environment up and running or conducting a POC – but it lays the foundation to turn that POC into a pilot or full enterprise deployment. The Enterprise edition adds significant value on the same base platform – you can grow into it. There are two main value adds: this offering hardens Hadoop, providing enterprise-quality stability, and it provides an analytics layer. Specifically, it includes a rock-solid file system alternative to what’s included in open source Hadoop, text analytics, analytics visualization, security, integrated web console, workflow and scheduling, indexing, and documentation. Transcript : What we&apos;re doing is putting together a comprehensive solution around BIG Data and if you&apos;re going to the sessions here you&apos;ll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we&apos;re looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we&apos;re applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: IBM IOD 2010_GS Day 2 Transcript : What we&apos;re doing is putting together a comprehensive solution around BIG Data and if you&apos;re going to the sessions here you&apos;ll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we&apos;re looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we&apos;re applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: 06/25/13 Transcript : What we&apos;re doing is putting together a comprehensive solution around BIG Data and if you&apos;re going to the sessions here you&apos;ll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we&apos;re looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we&apos;re applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: Prensenter name here.ppt Transcript : What we&apos;re doing is putting together a comprehensive solution around BIG Data and if you&apos;re going to the sessions here you&apos;ll hear more about this in some of the breakouts, certainly in the expo demonstration capability. But we&apos;re looking at and starting to deliver scenarios that combine both non-traditional BIG Data types of information together with traditional data. And putting those together in creative ways to allow you to navigate and mine and understand the patterns across this very large corpus of information. And of course, as required, we&apos;re applying real-time stream processing into that flow because we see many of these scenarios demanding real-time analytics. Author’s Original Notes: