The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.
3. Revolution Confidential
Today’s Challenge:
Accelerating Business Cadence
Changing Business Environment
• Fact Based Decisions Require More Data
• Need to Understand Tradeoffs and Best Course of Action
• Predictive Models Need to Continually Deliver Lift
• Reduced Shelf Life for Predictive Models
Faster Time to Value
• Reduce Analytic Cycle Time
• Build & Deploy Models Faster
• Eliminate Time Consuming Data Movements
Rapid Customer Facing Decisions
• Score More Frequently
• Need to Make Best Decision in Real Time
3
5. Revolution Confidential
Typical Technology Challenges
Our Customers Face
Big Data
• New Data
Sources
• Data Variety &
Velocity
• Fine Grain
Control
• Data Movement,
Memory Limits
Complex
Computation
• Experimentation
• Many Small
Models
• Ensemble
Models
• Simulation
Enterprise
Readiness
• Heterogeneous
Landscape
• Write Once,
Deploy Anywhere
• Skill Shortage
• Production
Support
Production
Efficiency
• Shorter Model
Shelf Life
• Volume of
Models
• Long End-to-End
Cycle Time
• Pace of Decision
Accelerated
5
12. Revolution Confidential
Big Data Big Analytics Use Cases
12
• Build predictive models with (very) large datasets
• More rows/observations and/or more columns/features
• Tend to use dimension reduction, machine learning and/or ensemble techniques
One Big Model
• Score and predict with (very) large datasets with previously built model
• Score in batch or individual transactions
• Previously built model may be exported from model build to model deployment env.
Big Data Scoring
• Model factories build predictive models in quantity
• Automated building of individualized models and/or parallel individualized model
execution
Many Small
Models
• Score and predict with many individualized models
• Production model factories require model management
Scoring Many
Models
• Analytic models that are mathematically intense
• May not use large data sets but generate a lot of interim calculations
• May include vectorization, simulation, optimization
Computationally
Intensive Analytics
12
13. Revolution Confidential
Big Data Big Analytics
Specialized Use Cases
• Build forecasts with time sequenced data
• For Big Data, tend to be many small models esp. machine data
• Due to typical Big Data volume requires model management
Time Series
Analytics
• Use of unstructured, free text
• For Big Data, typically used to enhance structured predictive analytics
• Minimally requires text processing tools and may also require natural language
processing
Text and Document
Analytics
• Analyzing continuous, high speed data flows for patterns and acting upon the
patterns in real-time
• Requires specialized sampling and filtering techniques
• Uses distinct discovery analytics methods such as frequent itemsets or clustering
Mining Data
Streams
• No separation of model building and model scoring
• As real-time data becomes more widely available, this emerging category reduces
time-to-insight with little or no separation between model building and scoring
Zero Latency
13
14. Revolution Confidential
Revolution Confidential
Analytic Reference Architecture
Decision
Analytic Applications
Integration
Middleware
Data
Hadoop
Data
Warehouse
Other
Data
Sources
Analytics
Analytics Development Tools &
Platforms
|||||||||||||||||||||||||||
14
15. Revolution Confidential
Revolution Confidential
Architectural Approaches to Analytics
Beside Architecture Inside Architecture
DecisionIntegrationAnalytics
Analytics Development Tools & Platforms
Local Data Mart
Data
||||||||||||
||||||||||||
DecisionIntegration
Data+Analytics
Analytics Development Tools & Platforms
Analytic Applications
Middleware
Data Sources
Data Sources
Analytic Applications
Middleware
15
16. Revolution Confidential
Pros & Cons of Architectural Approaches
• Analytic workflow tasks performed in a separate analytics
environment outside of the source database
• Pros: Segregates analytic workload
• Cons: Doesn’t leverage powerful production for transformations,
introduces scoring latencies,
Beside
Architecture
• Analytics workflow tasks performed inside the source database
with embedded analytics
• Pros: Eliminates data movement, reduces model latency, allows
exploration of all data
• Cons: IT governance on production, potential new skills
Inside
Architecture
• Some analytic workflow tasks performed inside the source
database & others performed in a separate analytics environment
• Pros: Leverages strengths of each architecture
• Cons: Maintain multiple environments
Hybrid
Architecture
16
17. Revolution Confidential
Building & Deploying Analytic Models
Beside
Architecture
Inside
Architecture
Hybrid
Architecture
Analytics
Analytics Development
Tools & Platforms
Local Data Mart
Data
Data Sources
24 3 34 1
Data+Analytics
Analytics Development
Tools & Platforms
Data Sources
2 31
Analytics
Analytics Development
Tools & Platforms
Local Data Mart
Data+Analytics
Analytics Development
Tools & Platforms
Data Sources1 2
LEGEND
Model Build
Model Deploy
Model Recode / PMML
Update DataData Prep / Marshaling
134
24. Revolution Confidential
What is the R Language?
A Platform…
A Procedural Language for Stats, Math and Data Science
A Complete Data Visualization Framework
Provided as Open Source
A Community…
2M+ Users with the Skill to Tackle Big Data Statistical and
Numerical Analysis and Machine Learning Projects
Active User Groups Across the World
An Ecosystem
CRAN: 4500+ Freely Available Algorithms, Test Data and
Evaluations
24
25. Revolution Confidential
Revolution R Enterprise
Revolution R Enterprise
is the only enterprise big data big analytics platform
based on open source R statistical computing language
Portable Across Enterprise Platforms
High Performance, Scalable Analytics
Easier to Build & Deploy
25
26. Revolution Confidential
R is open source and drives analytic innovation but….
has some limitations for Enterprises
Disk based
scalability
Parallel threading
Commercial
support
Leverage open
source packages
plus Big Data ready
packages
26
Commercial
License
In memory bound
Single threaded
Community support
4500+ innovative
analytic packages
Risk of deployment
of open source
Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
Commercial
Viability
26
27. Revolution Confidential
Language
Interpreter and
Standard R
Algorithm Suites
Development &
Deployment Tooling
Big Data Distributed
Execution Platform
Introducing Revolution R Enterprise
The Big Data Big Analytics Platform
R+CRAN
RevoR
DistributedR
ConnectR
ScaleR
DevelopR DeployR
Revolution R Enterprise
27
28. Revolution Confidential
Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
28
First, we enhance and
accelerate the Open
Source R interpreter.
28
29. Revolution Confidential
Open Source R performance:
Multi-threaded Math
Open
Source R
29
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x
performance improvements
compared to Open Source R —
without changing any code
30. Revolution Confidential
Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
30
Second, we built a
platform for hosting R
with Big Data on a
variety of massively
parallel platforms.
30
31. Revolution ConfidentialRevolution R Enterprise DistributedR
Innovative Memory Management, Multi-Threaded Execution, Multi-Core Processing
• A Revolution R Enterprise ScaleR analytic is provided a data source as input
• The analytic loops over data, reading a block at a time.
• Blocks of data are read by a separate worker thread (Thread 0).
• Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update
intermediate results objects in memory
• When all of the data is processed a master results object is created from the intermediate results objects
COMBINE INTERMEDIATE RESULTS
31
33. Revolution Confidential
SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
33
34. Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
Data import – Delimited,
Fixed, SAS, SPSS, OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category
(means, sums)
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
Subsample (observations &
variables)
Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
34
35. Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
Sum of Squares (cross product
matrix for set variables)
Multiple Linear Regression
Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
Covariance & Correlation
Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Histogram
Line Plot
Scatter Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
K-Means
Statistical Modeling
Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
Monte Carlo
Variable Selection
Stepwise Regression (for linear reg)
35
36. Revolution Confidential
Unparalleled Big Data Big Analytics
Scale, Performance & Innovation
1 + 1 = 1000’s
Performance
V
a
l
u
e
Revolution R Enterprise
+ =
Performance
Enhanced R
R Language
Open Source
R Analytic
Packages
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Open Source
R Analytic
Packages
Performance Enhanced R
36
37. Revolution Confidential
Leveraging CRAN with DistributedR & ScaleR
Big Data Distillation
Allows a R programmer to leverage RRE ScaleR to reduce dimensionality
prior and input the reduced data set into open source packages so that the
computationally intensive portion is sped up with RRE ScaleR techniques
and any of the plethora of open source packages can be leveraged
Big Data Threading
Allows a R programmer to leverage RRE ScaleR to execute algorithms
designed for SMP environments in parallel using DistributedR (ie: Monte
Carlo simulation)
Supercharge Open Source package with RRE
Allows a R programmer to re-engineer a CRAN routine by replacing an
Open Source function inside an R based algorithm with the equivalent
ScaleR function(s)
High Performance Custom Algorithm
Allows a R programmer to use the RRE high throughput extreme data
format (XDF) to apply any combination of Open Source functions and logic
while chunking through an XDF file to overcome the Open Source R
memory limitations
37
39. Revolution Confidential
Big Analytics on Big Data in Hadoop
100% R on Hadoop
Full Skill Transfer - No Java needed.
Use 4500+ CRAN Packages
Blend Combine R & Other Tools /
Methods
100% Portability
Build Once – Deploy Many
Track Evolution of Hadoop
Protect Against Platform Uncertainty
Avoid Platform Lock-ins
Hadoop Performance & Scale
Leverage Hadoop Parallelism Easily
Analyze Data Without Moving It
DataAnalyticsApplications
Hadoop
+
Scalable
Compute
HDFS
HBase
Portability.
Parallel Storage
Hive
Big Data
Scale
100% R.
39
40. Revolution Confidential
Revolution Confidential
Revolution R Enterprise + Cloudera Propels
Enterprises into the Future
Decision
Analytic Applications
Integration
Middleware
Data
Cloudera
Data Management Platform
Analytics
Revolution R Enterprise
Big Data Big Analytics Platform
|||||||||||||||||||||||||||
40
41. Revolution Confidential
Revolution R Enterprise Powers
Write Once, Deploy Anywhere
41
Beside
Architecture
Inside
Architecture
Hybrid
Architecture
Analytics
Revolution R Enterprise
Local Data Mart
Data
Cloudera
24 3 34 1
Data+Analytics
Revolution R Enterprise
Cloudera
2 31
Analytics
Revolution R Enterprise
Local Data Mart
Data+Analytics
Revolution R Enterprise
Cloudera1 2
LEGEND
Model Build
Model Deploy
Model Recode / PMML
Update DataData Prep / Marshaling
4 |||||||||||||
|||||||||||||
|||||| Direct Connector
Bottom Line: Save Time, Save Money, Get Insights Faster
• Direct connectors access data without data movement
• Push down analyzing data without movement
• Use same R script on any platform without recoding
• Use right architecture for the job!
42. Revolution Confidential
Revolution R Enterprise Inside Cloudera
Consumption
Cloudera
Business Analysts
(Alteryx, Tableau,
QlikView, Cognos,
Microstrategy, Datameer
etc.)
Power Analysts
(R Studio, DevelopR, etc.)
Line of Business
users
(Analytic Apps, Rules
Engines, etc.)
Revolution R Enterprise
Machine Data
New Data Sources
Data Suppliers
Traditional Sources
IBM
Mainframe
Data Sources
R+CRAN
RevoR
DistributedR
ConnectR
ScaleR
DeployR
Big Data Big Analytics
Data Transformation,
Model Building & Scoring
42
43. Revolution Confidential
QuickStart Programs Deliver Value Quickly
Offered by both Cloudera and Revolution
Analytics
Combine Software, Services and Training
Cloudera can help you get started with
Hadoop in a few ways
Revolution Analytics helps you realize value
from R + Hadoop
43
44. Revolution Confidential
Summary
Revolution R Enterprise and Cloudera Hadoop bring best-of-breed
technologies to deliver:
Highly scalable and high performance machine learning on data
residing in Hadoop
Using the familiar R programming environment makes analytics
at scale accessible and easy for R users
With the ability to integrate disparate data sources in one
repository, full lifecycle analytics from ad-hoc analysis to
production analytics are available in one managed environment
The deep integration of Revolution R Enterprise with Cloudera
will provide a seamless operational experience for managing
both products
44