More Related Content Similar to Pivotal: Virtualize Big Data to Make the Elephant Dance (20) Pivotal: Virtualize Big Data to Make the Elephant Dance1. Virtualize Big Data to
Make the Elephant
Dance
June Yang, Senior Director of Product Management, VMWare
Dan Baskett, Senior Consultant Technologist, Pivotal
© Copyright 2013 EMC Corporation. All rights reserved.
1
2. Unstructured Data is exploding… Hadoop is driving growth
Hadoop adoption is ramping
Unstructured data driving growth
Don't know Other
2%
2%
Testing
2%
Complex unstructured data
forecasted to outpace structured
relational data by 10x by 2020
Piloting
18%
Inproduction
23%
2011
2012
2013
2014
2015
2016
Structured
2017
2018
Unstructured
2019
Evaluating
53%
2020
Source: Forrester Survey of 60 CIOs , September 2011
• Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider
Enterprise data strategy
•
•
Gartner predicts +800% data growth over next 5 years
Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs
© Copyright 2013 EMC Corporation. All rights reserved.
2
3. Broad Application of Hadoop Technology
Use Cases
Vertical Industries
Log Processing / Click
Stream Analytics
Financial Services
Machine Learning /
sophisticated data mining
Internet Retailer
Web crawling / text
processing
Pharmaceutical / Drug
Discovery
Extract Transform Load
(ETL) replacement
Mobile / Telecom
Image / XML message
processing
Scientific Research
General archiving /
compliance
Social Media
Hadoop is a platform that will revolutionize how Enterprises handle data
© Copyright 2013 EMC Corporation. All rights reserved.
3
4. The Big Data Journey in the Enterprise
Integrated
Stage 3: Cloud Analytics Platform
• Serve many departments
• Often part of mission critical workflow
• Fully integrated with analytics/BI tools
Stage 2: Hadoop Production
• Serve a few departments
• More use cases
• Growing # and size of clusters
• Core Hadoop + components
Stage1: Hadoop Piloting
• Often start with line of business
• Try 1 or 2 use cases to explore
the value of Hadoop
0 node
© Copyright 2013 EMC Corporation. All rights reserved.
10’s
100’s
Scale
4
6. One click to scale out your cluster on the fly
© Copyright 2013 EMC Corporation. All rights reserved.
6
7. Customize your Hadoop/Hbase Cluster
Customize with Cluster
Specification File
© Copyright 2013 EMC Corporation. All rights reserved.
7
8. Cluster Spec File Details
Storage configuration
Choice of shared storage or Local disk
High availability option
# of Hadoop nodes
Resource configuration
© Copyright 2013 EMC Corporation. All rights reserved.
Cluster Specification File
"groups":[
{ "name":"master",
"roles":[
"hadoop_namenode",
"hadoop_jobtracker”],
"storage": {
"type": "SHARED”, sizeGB": 20},
"instance_type":MEDIUM,
"instance_num":1,
"ha":true},
{"name":"worker",
"roles":[
"hadoop_datanode",
"hadoop_tasktracker"
],
"instance_type":SMALL,
"instance_num":5,
"ha":false
…
8
9. Your Choice of Hadoop Distributions and Tools
Distributions
Community Projects
• Flexibility to choose and try out major distributions
• Support for multiple projects
• Open architecture to welcome industry participation
• Contributing Hadoop Virtualization Extensions (HVE) to open source
community
© Copyright 2013 EMC Corporation. All rights reserved.
9
10. Proactive monitoring with VCOPs
Proactively monitoring through VCOPs
Gain comprehensive visibility
Eliminate manual processes with intelligent automation
Proactively manage operations
Alternatively, use monitoring tools like Nagios, Ganglia
© Copyright 2013 EMC Corporation. All rights reserved.
10
11. Beyond day 1 - Automation of Hadoop Cluster lifecycle management
…
Deploy
Custo
mize
Scaling
Tune
config
uration
Load
data
Execut
e jobs
© Copyright 2013 EMC Corporation. All rights reserved.
11
12. The Big Data Journey in the Enterprise
Integrated
Stage 2: Hadoop Production
• Serve a few departments
• More use cases
• Growing # and size of clusters
• Core Hadoop + components
Stage1: Hadoop Piloting
Rapid deployment
On the fly cluster resizing
Choice of Hadoop distros
Automation of cluster lifecycle
0 node
© Copyright 2013 EMC Corporation. All rights reserved.
10’s
100’s
Scale
12
13. Achieve HA for the Entire Hadoop Stack
Zookeepr
(Coordination)
Pig
(Data Flow)
BI Reporting
Hive
(SQL)
RDBMS
Hive MetaDB
HCatalog
Hcatalog MDB
MapReduce (Job Scheduling/Execution System)
HBase (Key-Value store)
HDFS
(Hadoop Distributed File System)
Jobtracker
Namenode
Management Server
ETL Tools
Server
• vSphere HA is battle-tested high availability technology
• Single mechanism to achieve HA for the entire Hadoop stack
• One click to enable HA and/or FT
© Copyright 2013 EMC Corporation. All rights reserved.
13
14. Challenges of Running Hadoop in Enterprises
Dept A: recommendation engine
Production
Production
Test
Log files
Experimentation
Transaction data
Dept B: ad targeting
Social data
© Copyright 2013 EMC Corporation. All rights reserved.
On the horizon…
NoSQL
Real time SQL
…
Test
Experimentation
Historical cust behavior
Pain Points:
1. Cluster sprawling
2. Redundant common data in
separate clusters
3. Difficult use the right tool for
the right problem
4. Peak compute and I/O
resource is limited to number
of nodes in each independent
cluster
14
15. What if you can…
Recommendation engine
Ad targeting
Production
Production
Test
Experimentation
Test
Experimentation
© Copyright 2013 EMC Corporation. All rights reserved.
One physical platform to support multiple virtual
big data clusters
Experimentation
Production
recommendation engine
Test/Dev
Production
Ad Targeting
15
16. Bigger is Better
Hadoop is linearly scalable, more nodes, better performance,
for the same job, it will take
– 2 hour to complete on a 50 node cluster
– 1 hour to complete on a 100 node cluster
– 30 min to complete on a 200 node cluster
© Copyright 2013 EMC Corporation. All rights reserved.
16
17. You may ask
What about differentiated SLAs
–
–
For production Hadoop jobs, need to ensure high priority
Lower priority of experimental Hadoop jobs.
Will I have a noisy neighbor problems with shared infrastructure
approach?
© Copyright 2013 EMC Corporation. All rights reserved.
17
18. VM Containers with Isolation are a Tried and Tested
Approach
Reckless Workload 2
Hungry Workload 1
Noisy
Workload 3
VMware vSphere + Serengeti
Host
Host
© Copyright 2013 EMC Corporation. All rights reserved.
Host
Host
Host
Host
Host
18
19. Shared infrastructure: Three big types of Isolation are Required
Resource Isolation
• Control the greedy noisy neighbor
• Reserve resources to meet needs
Version Isolation
• Allow concurrent OS, App, Distro versions
Security Isolation
• Provide privacy between users/groups
• Runtime and data privacy required
VMware vSphere + Serengeti
Host
Host
© Copyright 2013 EMC Corporation. All rights reserved.
Host
Host
Host
Host
Host
19
20. With virtualization, you can have your cake and eat it
too
One physical platform to support
multiple virtual big data clusters
Experimentation
Compute
layer
Data
layer
Production
recommendation engine
Test/Dev
Production
Ad Targeting
VMware vSphere + Serengeti
–
–
Low Priority
High Priority
–
–
Share data to minimize copying
Single infrastructure to
maintain
Bigger cluster for better
performance
Share hardware resource to
achieve higher utilization
Virtualization ensures strong
isolation between clusters.
–
–
–
–
© Copyright 2013 EMC Corporation. All rights reserved.
Resource isolation.
Failure isolation
Configure isolation
Security isolation
20
21. Elastic Hadoop with Virtualization
VM
Hadoop Node
Combined
Storage/Com
pute
Unmodified Hadoop
node in a VM
VM lifecycle
determined
by Datanode
Limited elasticity
© Copyright 2013 EMC Corporation. All rights reserved.
VM
VM
T1
Compute
VM
Storage
Separate Compute from
Storage
Separate compute
from data
Stateless compute
Elastic compute
VM
VM
T2
Storage
Separate Virtual Compute Clusters
per tenant
Separate virtual compute
Compute cluster per tenant
Stronger VM-grade security
and resource isolation
21
22. Scale in/out Hadoop dynamically
Deploy separate compute clusters for different tenants sharing HDFS.
Commission/decommission task trackers according to priority and
available resources
Job
Tracker
Job
Tracker
Compute layer
Compute
VM
Compute
VM
Dynamic resourcepool
Experimentation
Experimentation
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM
Production
recommendation engine
Production
VMware vSphere + Serengeti
Data layer
© Copyright 2013 EMC Corporation. All rights reserved.
22
23. The Big Data Journey in the Enterprise
Integrated
Stage 3: Cloud Analytics Platform
• Serve many departments
• Often part of mission critical workflow
• Fully integrated with analytics/BI tools
Stage 2: Hadoop Production
High Availability
Consolidation
Differentiated SLAs
Elastic Scaling
Stage1: Hadoop Piloting
Rapid deployment
On the fly cluster resizing
Choice of Hadoop distros
Automation of cluster lifecycle
0 node
© Copyright 2013 EMC Corporation. All rights reserved.
10’s
100’s
Scale
23
24. Business
Intelligence
Cloud Analytics Platform
Machine
Learning
Real Time
Streams
CETAS
Automated
Models
Stream
Processing
E
T
L
Data Visualization
…
Real Time
Structured
Database
Data
Warehouse
Unstructured
and Batch
Processing
HDFS
Compute
© Copyright 2013 EMC Corporation. All rights reserved.
Cloud Infrastructure
Storage
Networking
24
25. Big Data Tools and Characteristics
Framework
Scale of
data
Scale of
Cluster
Computable
Data?
Local Disks?
Map-reduce:
100s PB
10s to 1,000s
Yes
Yes, for cost,
bandwidth and
availability
Big-SQL:
PB’s
10s to 100s
Some
Yes, for cost and
bandwidth
No-SQL:
Cassandra, hBase, …
Trilions
Of rows
10s to 100s
Some
Yes, for cost and
availability
In-Memory:
Billions of rows
10s-100s
Yes
Primarily
Memory
Hadoop
HawQ,, Aster Data, Impala,
…
Redis, Gemfire, Membase,
…
© Copyright 2013 EMC Corporation. All rights reserved.
25
26. Choose a platform that…
Allows user to pick the right tools at the right
time
Put resources where needed based on SLA policy
© Copyright 2013 EMC Corporation. All rights reserved.
26
27. In-house Hadoop as a Service – (Hadoop + Hadoop)
Production
ETL of log files
Ad hoc
data mining
Compute
layer
Data
layer
Production
recommendation engine
HDFS
HDFS
VMware vSphere + Serengeti
Host
© Copyright 2013 EMC Corporation. All rights reserved.
Host
Host
Host
Host
Host
27
28. Integrated Big Data Production – (Mixed big data workloads)
Hadoop
batch analysis
Compute
layer
Data
layer
HBase
real-time queries
HDFS
NoSQL –
Cassandra
key-value
store
MPP DBMS –
Analysis of
structured data
VMware vSphere + Serengeti
Host
© Copyright 2013 EMC Corporation. All rights reserved.
Host
Host
Host
Host
Host
28
29. Integrated Hadoop and Webapps – (Big Data + Other Workloads)
Short-lived
Hadoop compute cluster
Compute
layer
Data
layer
Hadoop
compute cluster
Web servers
for ecommerce site
HDFS
VMware vSphere + Serengeti
Host
© Copyright 2013 EMC Corporation. All rights reserved.
Host
Host
Host
Host
Host
29
30. The Big Data Journey in the Enterprise
Stage 3: Cloud Analytics Platform
Mixed workloads
Right tool at the right time
Flexible and elastic infrastrure
Integrated
Stage 2: Hadoop Production
High Availability
Consolidation
Differentiated SLAs
Elastic Scaling
Stage1: Hadoop Piloting
Rapid deployment
On the fly cluster resizing
Choice of Hadoop distros
Automation of cluster lifecycle
0 node
© Copyright 2013 EMC Corporation. All rights reserved.
10’s
100’s
Scale
30
31. Learn More
Download and try Serengeti
–
projectserengeti.org
• VMware Hadoop site
–
vmware.com/hadoop
• Hadoop performance on vSphere white
paper
–
http://www.vmware.com/files/pdf/techpaper
/hadoop-vsphere51-32hosts.pdf
• Hadoop virtualization extensions (HVE)
Whitepaper
–
© Copyright 2013 EMC Corporation. All rights reserved.
http://www.vmware.com/files/pdf/techpaper
/hadoop-vsphere51-32hosts.pdf
31
32. Thank You!
June Yang
Senior Director, VMware
juneyang@vmware.com
© Copyright 2013 EMC Corporation. All rights reserved.
Dan Baskette
Senior Consultant Technologist
dan.baskette@emc.com
32
33. Pivotal Sessions at EMC World
Session
Presenter
Dates/Times
The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications
Josh Klahr
Tue 5:30 - 6:30, Palazzo E Wed
11:30 - 12:30, Delfino 4005
Pivotal: Data Scientists on the Front Line: Examples of
Data Science in Action
Noelle Sio
Tue 10:00 - 11:00, Lando 4205
Thu 8:30 - 9:30, Palazzo F
Pivotal: Operationalizing 1000-node Hadoop Cluster –
Analytics Workbench
Clinton Ooi
Bhavin Modi
Tue 11:30 - 12:30, Palazzo L Thu
10:00- 11:00 am, Delfino 4001A
Pivotal: for Powerful Processing of Unstructured Data For
Valuable Insights
SK
Krishnamurthy
Mon 4:00 - 5:00, Lando 4201 A
Tue 4:00 - 5:00, Palazzo M
Pivotal: Big & Fast data – merging real-time data and deep
analytics
Michael
Crutcher
Mon 1:00 - 2:00, Lando 4201 A
Wed 10:00 - 11:00, Palazzo M
Pivotal: Virtualize Big Data to Make The Elephant Dance
June Yang
Dan Baskette
Mon 11:30 - 12:30, Marcello
4401A Wed 4:00 - 5:00, Palazzo
E
Hadoop Design Patterns
Don Miner
Mon 2:30 - 3:30, Palazzo F Wed
8:30 - 9:30, Delfino 4005
© Copyright 2013 EMC Corporation. All rights reserved.
33