2. Infrastructure, Apps and now Data…
Build Run
Private
Public
Manage
Simplify Infrastructure Simplify App Platform
Simplify Data
With Cloud Through PaaS
2
3. Trend 1/3: New Data Growing at 60% Y/Y
Exabytes of information stored 20 Zetta by 2015
1 Yotta by 2030
Yes, you are part
of the yotta
audio( generation…
digital(tv(
digital(photos(
camera(phones,(rfid(
medical(imaging,(sensors(
satellite(images,(logs,(scanners,(twi7er(
cad/cam,(appliances,(machine(data,(digital(movies(
Source: The Information Explosion, 2009
3
7. Early Adopters: Enterprise Segmentation
Verticals! Targets! Use Cases!
• Financial Services" • Existing Hadoop Users" • Business Trend Analytics"
• Retail" • Business Analysts" • Revenue analytics"
• Telco" • Data Scientists" • CDR, call pattern analytics"
• Manufacturing" • LOB managers" • Sensor data analytics"
• Government" • IT/Ops" • Log, machine data analytics"
• Fraud detection"
• Homeland security"
• Predictive analytics"
7
8. Early Adopters: Non-enterprise Segmentation
Verticals! Targets! Use Cases!
• Online Advertising" • End users/Exec users" • Behavioral Analytics"
• eCommerce" • Business Analysts" • Audience segmentation"
• Mobile" • PM, LOB managers" • Revenue Optimization"
• Social Media" • Marketing/Sales" • User activity monetization"
• Gaming" • Data Engineers" • Inventory, price
• Data Scientists" management"
• IT/Operations" • Recommendations"
• Predictive analytics"
8
9. Why now? more transactions (Social/Mobile/Local)
SoMoLo
30B
500 TB messages/ 35 check-ins/ 13k API calls/
data/day month sec sec
Big
“traditional”
companies 1TB data/
day 10k card
3.7B calls/ transactions/sec
month
Size of data communications transactions
9
10. Trend 3/3: Value from Data Exceeds Hardware Cost
! Value from the intelligence of data analytics now outstrips the cost
of hardware
• Hadoop enables the use of 10x lower cost hardware
• Hardware cost halving every 18mo
Value
Big Iron:
$40k/CPU
Commodity
Cluster:
$1k/CPU
Cost
10
11. The Old Big Data Stack Business
Intelligence
Extract, Transform,
Data
Statistics
Load (SAS, SPSS) Visualization
(Informatica) (Crystal, Bus O)
Files
SQL
Databases E
T
L Column Oriented
Relational Database
(Oracle, Teradata, DB2)
Master Data
Management
(Oracle, SAP)
11
12. The Old Big Data Stack
! Unable to handle large data volumes & diversity of
data
! Iterative, brute-force and slow process
Business
! Lack of ad-hoc data navigation across events and Intelligence
time
! Cumbersome ETL to “process” and DBAs to
“prepare”
! Focused on structured data that is warehoused
! Web analytics solutions force real-time events into Data
rigid schemas in DBs Extract, Transform,
Load
Statistics
(SAS, SPSS)
Visualization
(Crystal, Bus
(Informatica) O)
Files
SQL
Databases E Column Oriented
T Relational Database
L (Oracle, Teradata, DB2)
Master Data
Management
(Oracle, SAP)
12
13. The Journey To Big Data Analytics
1 2 3
All Data Data Science Real Time Decisions
Faster Answers Collaboration New Applications
Elastic & Scalable Self-Service Data Monetization
Big Data Enabled Apps
Agile Process & Tools
Analytics Engines
Analytic Engines Analytic Productivity Platform
Cloud Infrastructure
BI As A Service Agile Analytics Predictive Enterprise
Technology Focus People & Productivity Focus Application Focus
Goal: encourage Goal: discover meaningful Goal: operationalize
experimentation insights that those insights
with existing data impact the business as quickly as possible
13
14. Customer profiles
1. Business analysts, LOB managers, execs
• Need: out-of-the-box analytics
• Designed for: self-service for end-user leveraging app
developers
2. Data engineers/analysts
• Need: out-of-the-box + some customization
• Designed for: admin + operations
3. Data scientists
• Need: power capabilities + heavy customization
• Designed for: data scientists
4. IT, Operations
• Need: out-of-the-box + some customization
• Designed for: IT/admin, ops
14
15. What is Data Science and Data Engineering?
Distributed,
Math and Statistical
Parallelization Algorithm
Knowledge
& programming Skills
Data Science
&
Data Engineering
Business Domain Vertical or Horizontal
and Problem Use case and Analytics
Understanding Experience
15
16. What is Driving Big Data?
Structured
Largely
Unstructured
Semi-structured
Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012
16
17. Today’s Big Data System:
Real Time
Streams
Real-Time
Processing
(s4, storm)
Analytics
ETL
Data
Real Time Parallel
Structured Big SQL Batch
Database Processing
Unstructured Data (HDFS)
17
18. The Unified Analytics Cloud Platform
Madlib
Analytics Tools Karmasphere
Data Meer Tableau
Hadoop R Developer Spring
PaaS
Python Frameworks Cloudfoundry
Cassandra hBase
HDFS Database/DataStore
HawQ Impala
Data-Director
Data Platform Data PaaS
EMC Chorus
vSphere Cloud Infrastructure
Private
Public
18
19. Business
The New Big Data System Intelligence
Real Time
Streams
Automated
Models
Real-Time
Stream Data Visualization
Processing (Excel, Tableau)
E
Common
Query T Real Time Structured Unstructured
L Structured Data
and Batch
Processing
Database Engine (Hadoop, Hive)
Federated Query
(SQL aggregation)
Structured and Unstructured Data
(HDFS, S3)
Cloud Infrastructure
Compute Storage Networking
19
22. Business
New Technologies Intelligence
Twitter Machine
Real Time
Sensor Data Learning
CETAS
Streams
Mobile Events
Machine Logs
Automated
Models
S4, Storm
Real-Time
Stream Data Visualization
…
Processing (Excel, Tableau)
E
Common
Query T SPARK
Real Time Aster, Unstructured
L SHARK
Structured Greenplum
and Batch
Map-Reduce
Gemfire Processing
Database
hBase? Etc, (Hadoop, Hive)
Query Virtualization
…
(SQL aggregation)
HDFS, Ceph, MAPR, Collosos
Cloud Infrastructure
Compute Storage Networking
22
23. Agenda
! Frameworks
• Batch processing: Hadoop, Spark
• Graph processing: Pregel, Apache Giraph
• Real-time processing: Storm, S4, D-Streams
• Interactive processing: Hive, Impala, Shark
! New requirements
• Better network architectures, abstractions and end-to-end resource
management
• Whither disk-locality and the flexibility to move data to compute
instead
• Cluster/Datacenter-wide storage abstractions and services
• The silo-less datacenter (multiple frameworks sharing a single
physical cluster and sharing sticky data)
23
24. Big Data Processing Patterns (batch, real-time or interactive)
Hadoop,
Hive, Impala Funnel Reverse Funnel Data transform
Storm, S4, (large input, small (small input, large (input and output
D-Streams, output, e.g., link/ad output, e.g., sizes similar, e.g,
Shark click-statistics) logfile loading) data conversion/
translation)
Spark
Iterative, e.g, Machine
learning tasks
Pregel,
Giraph
Graph-based analyses
to reason about relationships,
e.g., PageRank, Ravi s social approach to VI management
24
25. Batch processing frameworks (1/2)
! Apache Hadoop MapReduce (Yahoo!)
• Parallel data-processing paradigm (made popular by Google). Uses a
distributed file system (HDFS) for persistence. Uses commodity h/w
• Model of operation: Mapper (read from HDFS + compute in parallel) ->
Reducer (process map outputs in parallel) -> write to HDFS
• Key components: Namenode, Datanode, TaskTracker, JobTracker
• Apache Zookeeper sometimes used for coordination
• Weakness: Not well-suited for iterative (or graph) computations
25
26. Batch processing frameworks (2/2)
! Spark (UC Berkeley)
• Support for iterative computations and interactive data-mining by
caching data in cluster RAM. Uses commodity machines
• Core abstraction: Resilient Distributed Datasets (RDDs) used as
variables in Spark programs. RDDs include lineage data for easy
recovery/reconstruction
• Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, …
Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
26
27. Graph processing frameworks
! Pregel (Google)/Apache Giraph
Compute Communicate
Barrier
VM1 VM2
• Multiple instances of vertex-programs: user-defined functions running
at/on each vertex
• Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank
• Stateful in-memory computations. Fault-tolerance via checkpoints
• Runs on commodity hardware (racks with high intra-rack bandwidth)
27
29. Real-time processing frameworks (stream-processing) 2/2
! Discretized Streams/D-Streams (UC Berkeley)
• Treat a streaming computation as a series of batch computations on
small time intervals. D-Stream = chain of RDDs
• Fault-tolerance without replication or upstream backup (buffering)
Time
Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf
29
30. Interactive processing frameworks 1/4
! Apache Hive (Facebook)
• Open-source data warehouse built on top of Hadoop. HiveQL
queries compiled into MapReduce jobs. Expensive Where clauses =
Table scans = high latency
Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/
30
32. Interactive processing frameworks 3/4
! Impala (Cloudera)
• Inspired by Dremel (Google). Key concepts: columnar-data storage
(Trevni), aggregation trees for distributed query evaluation
• Takes advantage of Hive tables. Uses memory as a cache for tables
• Does not use MapReduce to answer queries (unlike Hive).
• 3X - 90X faster than Hive
Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
32
33. Interactive processing frameworks 4/4
! Shark (UC Berkeley)
• Key concepts: columnar-data storage (in-memory), Directed Acyclic
Graphs of Tasks for distributed query optimization and evaluation,
dynamic mid-query replanning
• Uses Spark RDDs to store data and query processing results
• SQL-interface (HiveQL compatible)
• 100X faster than Hadoop, 100X faster than Hive
Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf
33
34. Unifying the Big Data Platform using Virtualization
! Goals
• Make it fast and easy to provision new data Clusters on Demand
• Allow Mixing of Workloads
• Leverage virtual machines to provide isolation (esp. for Multi-tenant)
• Optimize data performance based on virtual topologies
• Make the system reliable based on virtual topologies
! Leveraging Virtualization
• Elastic scale
• Use high-availability to protect key services, e.g., Hadoop’s namenode/job
tracker
• Resource controls and sharing: re-use underutilized memory, cpu
• Prioritize Workloads: limit or guarantee resource usage in a mixed
environment
Cloud Infrastructure
Private
Public
34
35. A Unified Analytics Cloud Significantly Simplifies
! Simplify
• Single Hardware Infrastructure
• Faster/Easier provisioning
SQLCluster
Big SQL NoSQL Hadoop
NoSQL Cluster
Unifed Analytics Infrastructure
Private
Public
Hadoop Cluster
! Optimize
• Shared Resources = higher utilization
Decision Support Cluster
• Elastic resources = faster on-demand access
35
36. Simplify Hetrogeneous Data Management via Data PaaS
Large-
File- In- Big
Scale
system Memory SQL
NoSQL
Analytics Tools
Developer
Databases
Data PaaS – Common Data Management Layer
Data Platform Provisioning Multi-tenancy Import/Export
Cloud Infrastructure Management Data Discovery
Cloud Infrastructure
36
37. Technology: Databases and Data Stores for Big Data
Unstructured Structured
Large-
File- In- Big
Scale
system Memory SQL
NoSQL
Log files, machine Loosely typed device
Types of generated data, data, records, events, Structured,
Structured data
Data documents, statistics, complex partitionable data
device data, etc… relations/graphs
Techno- NAS, HDFS, Blob, Cassandra, hBase, Gemfire, Redis, HawQ, Impala, Aster,
logies S3, MAPR, etc.. Voldemort Membase, SPARK …
Store any data, High performance for
Easy to scale-out,
easy to scale-out, High Throughput, low repetitive queries.
Values flexible and dynamic
can optimize for latency Ease of query
schema’s
cost language.
37
38. The Unified Analytics Cloud Platform
Madlib
Analytics Tools Karmasphere
Data Meer Tableau
Hadoop R Developer Spring
PaaS
Python Frameworks Cloudfoundry
Cassandra hBase
HDFS Database/DataStore
Greenplum Voldemort
Data-Director
Data Platform Data PaaS
EMC Chorus
vSphere Cloud Infrastructure
Private
Public
38
39. Summary
! Revolution in Big Data is under way
• Data centric applications are now critical
! Hadoop on Virtualization
• Proven performance
• Cloud/Virtualization values apparent for Hadoop use
! Simplify through a Unified Analytics Cloud
• One Platform for today’s and future big-data systems
• Better Utilization
• Faster deployment, elastic resources
• Secure, Isolated, Multi-tenant capability for Analytics
39
40. References
! Twitter
• @richardmcdougll
! My CTO Blog
• http://communities.vmware.com/community/vmtn/cto/cloud
! Hadoop on vSphere
• Talk @ Hadoop World
• Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf
! Spring Hadoop
• http://blog.springsource.org/2012/02/29/introducing-spring-hadoop
40