Fully featured, commercially supported machine learning suites that can build Decision Trees in Hadoop are few and far between. Addressing this gap, Revolution Analytics recently enhanced its entire scalable analytics suite to run in Hadoop. In this talk, I will explain how our Decision Tree implementation exploits recent research reducing the computational complexity of decision tree estimation, allowing linear scalability with data size and number of nodes. This streaming algorithm processes data in chunks, allowing scaling unconstrained by aggregate cluster memory. The implementation supports both classification and regression and is fully integrated with the R statistical language and the rest of our advanced analytics and machine learning algorithms, as well as our interactive Decision Tree visualizer.
eSoftTools IMAP Backup Software and migration tools
Decision trees in hadoop
1. Decision Trees built in Hadoop
plus more Big Data Analytics
with Revolution R Enterprise
All Rights Reserved, Revolution Analytics 2014
Mario Inchiosa, US Chief Scientist
mario.inchiosa@revolutionanalytics.com
Revolution Webinar – April 17, 2014
2. 2
OUR COMPANY
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
OUR SOFTWARE
The only Big Data, Big
Analytics software platform
based on the data science
language R
SOME KUDOS
Visionary
Gartner Magic Quadrant
for Advanced Analytics
Platforms, 2014
3. Typical Challenges our Customers Face
Big
Data
Many new data
sources
Data variety &
velocity
Data
movement,
memory limits
3
Production
Efficiency
Shorter model
shelf life
Volume of
Models
Long end-to-
end cycle time
Pace of
decision
accelerated
Enterprise
Readiness
Heterogeneous
landscape
Write once,
deploy anywhere
Skill shortage
Production
support
Complex
Computation
Mathematically
sophisticated
Parallelization
Experimentation
Ensemble
models
Many small
models
Simulation
4. Polling Question:
What is your current analytics software platform?
– Please select one
• R/RRE
• SAS
• SPSS
• Tibco/Spotfire
• KXEN
• Other
6. What is R?
Most widely used data analysis software
• Used by 2M+ data scientists, statisticians and analysts
Most powerful statistical programming language
• Flexible, extensible and comprehensive for productivity
Create beautiful and unique data visualizations
• As seen in New York Times, Twitter and Flowing Data
Thriving open-source community
• Leading edge of analytics research
Fills the talent gap
• New graduates prefer R
R is Hot
bit.ly/r-is-hot
WHITE PAPER
7. Exploding growth and demand for R
R is the highest paid IT skill
– Dice.com, Jan 2014
R most-used data science language after SQL
– O’Reilly, Jan 2014
R is used by 70% of data miners
– Rexer, Sep 2013
R is #15 of all programming languages
– RedMonk, Jan 2014
R growing faster than any other data science
language
– KDnuggets, Aug 2013
More than 2 million users worldwide
R Usage Growth
Rexer Data Miner Survey, 2007-2013
70% of data miners report using R
R is the first choice of more
data miners than any other
software
Source: www.rexeranalytics.com
9. Revolution R Enterprise
High Performance, Scalable Analytics
Portable Across Enterprise Platforms
Easier to Build & Deploy Analytics
is….
the only big data big analytics platform
based on open source R
9
10. Big Data In-memory bound Hybrid memory & disk
scalability
Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise
Readiness
Community support Commercial support Delivers full service
production support
Analytic
Breadth &
Depth
5000+ innovative
analytic packages
Leverage open source
packages plus Big Data
ready packages
Supercharges R
R is open source and drives analytic innovation
but….has some limitations for Enterprises
10
11. All of Open Source R plus:
Big Data scalability
High-performance analytics
Development and deployment
tools
Data source connectivity
Application integration framework
Multi-platform architecture
Support, Training and Services
11
is the
Big Data Big Analytics Platform
12. R+CRAN
• Open source R interpreter
• UPDATED R 3.0.2
• Freely-available R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing
R scripts, functions and
packages
RevoR
• Performance enhanced R interpreter
• Based on open source R
• Adds high-performance math
Available On:
• IBM® Platform LSFTM Clusters
• Microsoft® HPC Clusters
• Windows® & Linux Servers
• Windows® & Linux Workstations
• NEW Cloudera® Hadoop
• NEW Hortonworks® Hadoop
• NEW Teradata® Database
• Intel® Hadoop
• IBM® BigInsightsTM
• IBM® PureDataTM for Analytics,
powered by Netezza technology
12
The Platform Step by Step:
R Capabilities
13. Revolution R Enterprise RevoR
Performance Enhanced R
Open
Source R
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x
performance improvements
compared to Open Source R —
without changing any code
13
14. DistributedR
• Distributed computing framework
• Delivers portability across platforms
ConnectR
• High-speed data import/export
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed format
text data files
• Hadoop HDFS (text & XDF)
• Teradata Database TPT
• ODBC (incl. Vertica, Oracle, Pivotal,
Aster, SybaseIQ, DB2, MySQL)
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical
tests
• Correlation & covariance matrices
• Predictive Models – linear, logistic,
GLM
• Machine learning
• Monte Carlo simulation
• NEW Tools for distributing
customized algorithms across nodes
DistributedR available on:
• Windows Servers
• Red Hat and NEW SuSE Linux Servers
• IBM Platform LSF Linux Clusters
• Microsoft HPC Clusters
• NEW Cloudera Hadoop
• NEW Hortonworks Hadoop
• NEW Teradata Database 14
The Platform Step by Step:
Parallelization & Data Sourcing
15. DeployR
• Web services software
development kit for integration
analytics via Java, JavaScript or
.NET APIs
• Integrates R Into application
infrastructures
Capabilities:
• Invokes R Scripts from
web services calls
• RESTful interface for
easy integration
• Works with web & mobile apps,
leading BI & Visualization tools and
business rules engines
DevelopR
• Integrated development
environment for R
• Visual ‘step-into’ debugger
Available on:
• Windows
DevelopR DeployR
15
The Platform Step by Step:
Tools & Deployment
16. 16
Scalable and Parallelized across Cores
and Nodes
0010
COMPUTE NODE
COMPUTE NODE
MULTICORE
PROCESSOR
4, 8, 16+ CORES
Evaluate
COMPUTE NODE
COMPUTE NODE
0101
0010
1110
1100
01010
DATA
PARTITION
BIG DATA
010101010010101001010010010010100101010101
011010
SHARED MEMORY
100101
101001
111000
01010101001001010011100100100101001010101010101010100100
Combine
Intermediate
Results
MASTER NODE
CORE0
THREAD 0
CORE01
THREAD 1
CORE02
THREAD 2
CORE03
THREAD N
17. ScaleR Scalability and Performance
Handles an arbitrarily large number of rows in a fixed amount of memory
Scales linearly with the number of rows
Scales linearly with the number of nodes
Scales well with the number of cores per node
Scales well with the number of parameters
Extremely high performance
17
18. Unique PEMAs: Parallel,
external-memory algorithms
High-performance, scalable
replacements for R/SAS
analytic functions
Parallel/distributed
processing eliminates CPU
bottleneck
Data streaming eliminates
memory size limitations
Works with in-memory and
disk-based architectures
18
Eliminates Performance and Capacity
Limits of Open Source R and Legacy SAS
19.
20. Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a
6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPCwire, April 21, 2011
See Revolution white paper for additional benchmarks.
Logistic Regression:
20
Revolution R Enterprise ScaleR
Outperforms SAS HPA at a Fraction of the Cost
21. Specific speed-related factors
Efficient computational algorithms
Efficient memory management – minimize data copying and data
conversion
Heavy use of C++ templates; optimal code
Efficient data file format; fast access by row and column
Models are pre-analyzed to detect and remove duplicate computations
and points of failure (singularities)
Handle categorical variables efficiently
Revolution R Enterprise 21
22. Scalability and portability of Revolution Analytics
“Parallel External Memory Algorithms” (PEMAs)
Anatomy of a PEMA: 1) Initialize, 2) Process Chunk,
3) Aggregate, 4) Finalize
Process a chunk of data at a time, giving linear scalability
Process an unlimited number of rows of data in a fixed
amount of RAM
Independent of the “compute context” (number of cores,
computers, distributed computing platform), giving portability
across these dimensions
Independent of where the data is coming from, giving
portability with respect to data sources
Revolution R Enterprise 22
23. Simplified ScaleR Internal Architecture
Revolution R Enterprise 23
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files, UDFs
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS,
CSV, Fixed, XDF
24. DistributedR
ScaleR
ConnectR
DeployR
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
In the Cloud Amazon AWS
Workstations & Servers Windows
Red Hat and SUSE Linux
Clustered Systems IBM Platform LSF
Microsoft HPC
EDW Teradata
IBM PureData™ for Analytics
Hadoop Cloudera
Hortonworks
24
Write Once.
Deploy Anywhere.
25. Decision Trees
– Easy-to-interpret models
– Widely used in a variety of disciplines. For example,
Predicting which patient characteristics are associated with high risk
of, for example, heart attack.
Deciding whether or not to offer a loan to an individual based on
individual characteristics.
Predicting the rate of return of various investment strategies
Retail target marketing
Can handle multi-level factor response easily
Useful in identifying important interactions
Revolution R Enterprise 25
26. Decision Tree Types
Classification tree: predict what ‘class’ or ‘group’ an
observation belongs to (dependent variable is a factor)
Regression tree: predict the value of a continuous
dependent variable
Revolution R Enterprise 26
27. 27
Polling Question:
What is your “go-to” tree algorithm for predictions in your work?
– Please select one answer
• Single Trees
• Random Forests
• It Depends
• Neither
28. Classification Example: Marketing Response
Data set containing the following information:
Response: Was response to a phone call, email, or mailing?
Age
Income
Marital status
Attended college?
Revolution R Enterprise 28
29. Estimating the model
treeOut <- rxDTree(response ~ age +
income + college + marital,
data = rdata)
Revolution R Enterprise 29
30. Simple Example: Text Output
– Information on the split, the number of observations in the node,
the “loss”, the predicted value, and the probabilities
1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000)
2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743)
4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639)
8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) *
9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) *
5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901)
10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054)
20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) *
21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) *
11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) *
3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) …
Revolution R Enterprise 30
32. The ‘Big Data’ Decision Tree Algorithm
Classical algorithms for building a decision tree sort all
continuous variables in order to decide where to split the data.
This sorting step becomes prohibitive when dealing with large
data.
rxDTree bins the data rather than sorting, computing histograms
to create empirical distribution functions of the data
rxDTree partitions the data “horizontally”, processing in parallel
different subsets of the observations
The accuracy of the parallel tree approximately equals that of the
serial tree (Ben-Haim & Tom-Tov, 2010)
Revolution R Enterprise 32
33. Useful rxDTree Arguments for Big Data
maxDepth: maximum tree depth
minBucket: minimum number of observations in a terminal node
minSplit: minimum number of observations needed to split
cp: minimum fit improvement needed to accept a split
maxNumBins: maximum number of bins used to bin numeric data
Revolution R Enterprise 33
34. 34
Related Decision Tree Functions
prune.rxDTree – model simplification
rxDTreeBestCp – optimizes pruning
rxAddInheritance – makes rxDTree compatible with rpart for printing
and plotting
createTreeView – interactive HTML graphics
rxPredict.rxDTree – scores new data using rxDTree model
rxDForest – Ensembles of Decision Trees – big data alternative to
randomForest package
rxVarImpPlot – plots variable importance as measured by rxDForest
rxPredict.rxDForest – scores new data using rxDForest model
35. Sample code for Decision Trees on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxDTree( ArrDelay ~ Origin + Year + Month + DayOfWeek
+ UniqueCarrier + CRSDepTime, data=airData )
35
36. Sample code for Decision Trees on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxDTree( ArrDelay ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
CRSDepTime, data=airData )
36
37. Write Once Deploy Anywhere
rxSetComputeContext("local") # DEFAULT
rxSetComputeContext(RxHadoopMR(<data, server environment arguments>))
# Summarize and calculate descriptive statistics
adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS)
# Fit Linear Model
arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS)
rxSetComputeContext(RxHpcServer(<data, server environment arguments>))
rxSetComputeContext(RxLsfCluster(<data, server environment arguments>))
Same code to be run anywhere …..
Local System
Set the desired compute context for data analytics execution…..
rxSetComputeContext(RxInTeradata(<data, server environment arguments>))
38. 38
Polling Question:
What platforms are you most interested in running tree models on your
data?
– Please select all that apply
• Server
• Grid
• Hadoop
• Teradata
• Other
39. Revolution R Enterprise ScaleR: High
Performance Big Data Analytics
Data Prep, Distillation & Descriptive Analytics
R Data Step
Descriptive
Statistics
Statistical
Tests
Sampling
Data import – Delimited, Fixed,
SAS, SPSS, ODBC
Variable creation & transformation
using any R functions and packages
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category (means,
sums)
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Subsample (observations &
variables)
Random Sampling
40. Revolution R Enterprise ScaleR (continued)
Statistical Modeling Machine Learning
Predictive
Models
Covariance/Correlation/Sum of
Squares/Cross-product Matrix
Multiple Linear Regression
Logistic Regression
Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit.
- User defined distributions & link
functions.
Classification & Regression Trees
and Forests
Residuals for all models
Histogram
ROC Curves (actual data and
predicted values)
Lorenz Curve
Line and Scatter Plots
NEW Tree Visualization
Data
Visualization
Variable
Selection
Stepwise Regression
Linear
NEW logistic
NEW GLM
Monte Carlo
Run open source R
functions and packages
across cores and nodes
Cluster
Analysis
K-Means
Classification
Decision Trees
NEW Decision Forests
Prediction (scoring)
NEW PMML Export
Simulation
and HPC
Deployment
41. 41
Resources For You
Big Data Decision Trees with R
– http://www.revolutionanalytics.com/whitepaper/big-data-decision-trees-r
Advanced, Big Data Analytics with R and Hadoop
– http://www.revolutionanalytics.com/whitepaper/advanced-big-data-analytics-r-
and-hadoop
Revolution R Enterprise: Faster than SAS
– http://www.revolutionanalytics.com/whitepaper/revolution-r-enterprise-faster-
sas
May 13, 2014 – Webinar presenting the results of our RRE vs. SAS
benchmarking
To Get RRE for yourself, please visit:
– http://www.revolutionanalytics.com/get-revolution-r-enterprise
43. Thank you
Revolution Analytics is the leading commercial
provider of software and support for the
popular open source R statistics language.
mario.inchiosa@revolutionanalytics.com
www.revolutionanalytics.com, 1.855.GET.REVO, Twitter: @RevolutionR
43