SlideShare a Scribd company logo
1 of 43
Decision Trees built in Hadoop
plus more Big Data Analytics
with Revolution R Enterprise
All Rights Reserved, Revolution Analytics 2014
Mario Inchiosa, US Chief Scientist
mario.inchiosa@revolutionanalytics.com
Revolution Webinar – April 17, 2014
2
OUR COMPANY
The leading provider
of advanced analytics
software and services
based on open source R,
since 2007
OUR SOFTWARE
The only Big Data, Big
Analytics software platform
based on the data science
language R
SOME KUDOS
Visionary
Gartner Magic Quadrant
for Advanced Analytics
Platforms, 2014
Typical Challenges our Customers Face
Big
Data
 Many new data
sources
 Data variety &
velocity
 Data
movement,
memory limits
3
Production
Efficiency
 Shorter model
shelf life
 Volume of
Models
 Long end-to-
end cycle time
 Pace of
decision
accelerated
Enterprise
Readiness
 Heterogeneous
landscape
 Write once,
deploy anywhere
 Skill shortage
 Production
support
Complex
Computation
 Mathematically
sophisticated
 Parallelization
 Experimentation
 Ensemble
models
 Many small
models
 Simulation
Polling Question:
 What is your current analytics software platform?
– Please select one
• R/RRE
• SAS
• SPSS
• Tibco/Spotfire
• KXEN
• Other
OPEN SOURCE R
What is R?
 Most widely used data analysis software
• Used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
• Flexible, extensible and comprehensive for productivity
 Create beautiful and unique data visualizations
• As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
• Leading edge of analytics research
 Fills the talent gap
• New graduates prefer R
R is Hot
bit.ly/r-is-hot
WHITE PAPER
Exploding growth and demand for R
 R is the highest paid IT skill
– Dice.com, Jan 2014
 R most-used data science language after SQL
– O’Reilly, Jan 2014
 R is used by 70% of data miners
– Rexer, Sep 2013
 R is #15 of all programming languages
– RedMonk, Jan 2014
 R growing faster than any other data science
language
– KDnuggets, Aug 2013
 More than 2 million users worldwide
R Usage Growth
Rexer Data Miner Survey, 2007-2013
70% of data miners report using R
R is the first choice of more
data miners than any other
software
Source: www.rexeranalytics.com
REVOLUTION R
ENTERPRISE
THE BIG DATA BIG
ANALYTICS
PLATFORM
Revolution R Enterprise
 High Performance, Scalable Analytics
 Portable Across Enterprise Platforms
 Easier to Build & Deploy Analytics
is….
the only big data big analytics platform
based on open source R
9
Big Data In-memory bound Hybrid memory & disk
scalability
Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded Parallel threading Shrinks analysis time
Enterprise
Readiness
Community support Commercial support Delivers full service
production support
Analytic
Breadth &
Depth
5000+ innovative
analytic packages
Leverage open source
packages plus Big Data
ready packages
Supercharges R
R is open source and drives analytic innovation
but….has some limitations for Enterprises
10
All of Open Source R plus:
 Big Data scalability
 High-performance analytics
 Development and deployment
tools
 Data source connectivity
 Application integration framework
 Multi-platform architecture
 Support, Training and Services
11
is the
Big Data Big Analytics Platform
R+CRAN
• Open source R interpreter
• UPDATED R 3.0.2
• Freely-available R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing
R scripts, functions and
packages
RevoR
• Performance enhanced R interpreter
• Based on open source R
• Adds high-performance math
Available On:
• IBM® Platform LSFTM Clusters
• Microsoft® HPC Clusters
• Windows® & Linux Servers
• Windows® & Linux Workstations
• NEW Cloudera® Hadoop
• NEW Hortonworks® Hadoop
• NEW Teradata® Database
• Intel® Hadoop
• IBM® BigInsightsTM
• IBM® PureDataTM for Analytics,
powered by Netezza technology
12
The Platform Step by Step:
R Capabilities
Revolution R Enterprise RevoR
Performance Enhanced R
Open
Source R
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x
performance improvements
compared to Open Source R —
without changing any code
13
DistributedR
• Distributed computing framework
• Delivers portability across platforms
ConnectR
• High-speed data import/export
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed format
text data files
• Hadoop HDFS (text & XDF)
• Teradata Database TPT
• ODBC (incl. Vertica, Oracle, Pivotal,
Aster, SybaseIQ, DB2, MySQL)
ScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical
tests
• Correlation & covariance matrices
• Predictive Models – linear, logistic,
GLM
• Machine learning
• Monte Carlo simulation
• NEW Tools for distributing
customized algorithms across nodes
DistributedR available on:
• Windows Servers
• Red Hat and NEW SuSE Linux Servers
• IBM Platform LSF Linux Clusters
• Microsoft HPC Clusters
• NEW Cloudera Hadoop
• NEW Hortonworks Hadoop
• NEW Teradata Database 14
The Platform Step by Step:
Parallelization & Data Sourcing
DeployR
• Web services software
development kit for integration
analytics via Java, JavaScript or
.NET APIs
• Integrates R Into application
infrastructures
Capabilities:
• Invokes R Scripts from
web services calls
• RESTful interface for
easy integration
• Works with web & mobile apps,
leading BI & Visualization tools and
business rules engines
DevelopR
• Integrated development
environment for R
• Visual ‘step-into’ debugger
Available on:
• Windows
DevelopR DeployR
15
The Platform Step by Step:
Tools & Deployment
16
Scalable and Parallelized across Cores
and Nodes
0010
COMPUTE NODE
COMPUTE NODE
MULTICORE
PROCESSOR
4, 8, 16+ CORES
Evaluate
COMPUTE NODE
COMPUTE NODE
0101
0010
1110
1100
01010
DATA
PARTITION
BIG DATA
010101010010101001010010010010100101010101
011010
SHARED MEMORY
100101
101001
111000
01010101001001010011100100100101001010101010101010100100
Combine
Intermediate
Results
MASTER NODE
CORE0
THREAD 0
CORE01
THREAD 1
CORE02
THREAD 2
CORE03
THREAD N
ScaleR Scalability and Performance
 Handles an arbitrarily large number of rows in a fixed amount of memory
 Scales linearly with the number of rows
 Scales linearly with the number of nodes
 Scales well with the number of cores per node
 Scales well with the number of parameters
 Extremely high performance
17
 Unique PEMAs: Parallel,
external-memory algorithms
 High-performance, scalable
replacements for R/SAS
analytic functions
 Parallel/distributed
processing eliminates CPU
bottleneck
 Data streaming eliminates
memory size limitations
 Works with in-memory and
disk-based architectures
18
Eliminates Performance and Capacity
Limits of Open Source R and Legacy SAS
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a
6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPCwire, April 21, 2011
See Revolution white paper for additional benchmarks.
Logistic Regression:
20
Revolution R Enterprise ScaleR
Outperforms SAS HPA at a Fraction of the Cost
Specific speed-related factors
 Efficient computational algorithms
 Efficient memory management – minimize data copying and data
conversion
 Heavy use of C++ templates; optimal code
 Efficient data file format; fast access by row and column
 Models are pre-analyzed to detect and remove duplicate computations
and points of failure (singularities)
 Handle categorical variables efficiently
Revolution R Enterprise 21
Scalability and portability of Revolution Analytics
“Parallel External Memory Algorithms” (PEMAs)
 Anatomy of a PEMA: 1) Initialize, 2) Process Chunk,
3) Aggregate, 4) Finalize
 Process a chunk of data at a time, giving linear scalability
 Process an unlimited number of rows of data in a fixed
amount of RAM
 Independent of the “compute context” (number of cores,
computers, distributed computing platform), giving portability
across these dimensions
 Independent of where the data is coming from, giving
portability with respect to data sources
Revolution R Enterprise 22
Simplified ScaleR Internal Architecture
Revolution R Enterprise 23
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files, UDFs
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS,
CSV, Fixed, XDF
DistributedR
ScaleR
ConnectR
DeployR
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
In the Cloud Amazon AWS
Workstations & Servers Windows
Red Hat and SUSE Linux
Clustered Systems IBM Platform LSF
Microsoft HPC
EDW Teradata
IBM PureData™ for Analytics
Hadoop Cloudera
Hortonworks
24
Write Once.
Deploy Anywhere.
Decision Trees
– Easy-to-interpret models
– Widely used in a variety of disciplines. For example,
Predicting which patient characteristics are associated with high risk
of, for example, heart attack.
Deciding whether or not to offer a loan to an individual based on
individual characteristics.
Predicting the rate of return of various investment strategies
Retail target marketing
 Can handle multi-level factor response easily
 Useful in identifying important interactions
Revolution R Enterprise 25
Decision Tree Types
 Classification tree: predict what ‘class’ or ‘group’ an
observation belongs to (dependent variable is a factor)
 Regression tree: predict the value of a continuous
dependent variable
Revolution R Enterprise 26
27
Polling Question:
 What is your “go-to” tree algorithm for predictions in your work?
– Please select one answer
• Single Trees
• Random Forests
• It Depends
• Neither
Classification Example: Marketing Response
Data set containing the following information:
 Response: Was response to a phone call, email, or mailing?
 Age
 Income
 Marital status
 Attended college?
Revolution R Enterprise 28
Estimating the model
treeOut <- rxDTree(response ~ age +
income + college + marital,
data = rdata)
Revolution R Enterprise 29
Simple Example: Text Output
– Information on the split, the number of observations in the node,
the “loss”, the predicted value, and the probabilities
1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000)
2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743)
4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639)
8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) *
9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) *
5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901)
10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054)
20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) *
21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) *
11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) *
3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) …
Revolution R Enterprise 30
Interactive HTML Graphics
Revolution R Enterprise 31
The ‘Big Data’ Decision Tree Algorithm
 Classical algorithms for building a decision tree sort all
continuous variables in order to decide where to split the data.
 This sorting step becomes prohibitive when dealing with large
data.
 rxDTree bins the data rather than sorting, computing histograms
to create empirical distribution functions of the data
 rxDTree partitions the data “horizontally”, processing in parallel
different subsets of the observations
 The accuracy of the parallel tree approximately equals that of the
serial tree (Ben-Haim & Tom-Tov, 2010)
Revolution R Enterprise 32
Useful rxDTree Arguments for Big Data
 maxDepth: maximum tree depth
 minBucket: minimum number of observations in a terminal node
 minSplit: minimum number of observations needed to split
 cp: minimum fit improvement needed to accept a split
 maxNumBins: maximum number of bins used to bin numeric data
Revolution R Enterprise 33
34
Related Decision Tree Functions
 prune.rxDTree – model simplification
 rxDTreeBestCp – optimizes pruning
 rxAddInheritance – makes rxDTree compatible with rpart for printing
and plotting
 createTreeView – interactive HTML graphics
 rxPredict.rxDTree – scores new data using rxDTree model
 rxDForest – Ensembles of Decision Trees – big data alternative to
randomForest package
 rxVarImpPlot – plots variable importance as measured by rxDForest
 rxPredict.rxDForest – scores new data using rxDForest model
Sample code for Decision Trees on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxDTree( ArrDelay ~ Origin + Year + Month + DayOfWeek
+ UniqueCarrier + CRSDepTime, data=airData )
35
Sample code for Decision Trees on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxDTree( ArrDelay ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
CRSDepTime, data=airData )
36
Write Once  Deploy Anywhere
rxSetComputeContext("local") # DEFAULT
rxSetComputeContext(RxHadoopMR(<data, server environment arguments>))
# Summarize and calculate descriptive statistics
adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS)
# Fit Linear Model
arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS)
rxSetComputeContext(RxHpcServer(<data, server environment arguments>))
rxSetComputeContext(RxLsfCluster(<data, server environment arguments>))
Same code to be run anywhere …..
Local System




Set the desired compute context for data analytics execution…..
rxSetComputeContext(RxInTeradata(<data, server environment arguments>))
38
Polling Question:
 What platforms are you most interested in running tree models on your
data?
– Please select all that apply
• Server
• Grid
• Hadoop
• Teradata
• Other
Revolution R Enterprise ScaleR: High
Performance Big Data Analytics
Data Prep, Distillation & Descriptive Analytics
R Data Step
Descriptive
Statistics
Statistical
Tests
Sampling
 Data import – Delimited, Fixed,
SAS, SPSS, ODBC
 Variable creation & transformation
using any R functions and packages
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category (means,
sums)
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Subsample (observations &
variables)
 Random Sampling
Revolution R Enterprise ScaleR (continued)
Statistical Modeling Machine Learning
Predictive
Models
 Covariance/Correlation/Sum of
Squares/Cross-product Matrix
 Multiple Linear Regression
 Logistic Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit.
- User defined distributions & link
functions.
 Classification & Regression Trees
and Forests
 Residuals for all models
 Histogram
 ROC Curves (actual data and
predicted values)
 Lorenz Curve
 Line and Scatter Plots
 NEW Tree Visualization
Data
Visualization
Variable
Selection
 Stepwise Regression
 Linear
 NEW logistic
 NEW GLM
 Monte Carlo
 Run open source R
functions and packages
across cores and nodes
Cluster
Analysis
 K-Means
Classification
 Decision Trees
 NEW Decision Forests
 Prediction (scoring)
 NEW PMML Export
Simulation
and HPC
Deployment
41
Resources For You
 Big Data Decision Trees with R
– http://www.revolutionanalytics.com/whitepaper/big-data-decision-trees-r
 Advanced, Big Data Analytics with R and Hadoop
– http://www.revolutionanalytics.com/whitepaper/advanced-big-data-analytics-r-
and-hadoop
 Revolution R Enterprise: Faster than SAS
– http://www.revolutionanalytics.com/whitepaper/revolution-r-enterprise-faster-
sas
 May 13, 2014 – Webinar presenting the results of our RRE vs. SAS
benchmarking
 To Get RRE for yourself, please visit:
– http://www.revolutionanalytics.com/get-revolution-r-enterprise
42
Thank you
Revolution Analytics is the leading commercial
provider of software and support for the
popular open source R statistics language.
mario.inchiosa@revolutionanalytics.com
www.revolutionanalytics.com, 1.855.GET.REVO, Twitter: @RevolutionR
43

More Related Content

What's hot

Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataJoey Li
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopGhassan Al-Yafie
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworksAmal Targhi
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 

What's hot (20)

Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
Rob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoopRob peglar introduction_analytics _big data_hadoop
Rob peglar introduction_analytics _big data_hadoop
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Big data with java
Big data with javaBig data with java
Big data with java
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworks
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 

Viewers also liked

Learning Linear Models with Hadoop
Learning Linear Models with HadoopLearning Linear Models with Hadoop
Learning Linear Models with HadoopDataWorks Summit
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...Revolution Analytics
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)Yu Liu
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technologyphanleson
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.Vijaykumar Vangapandu
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
Data Mining. Classification
Data Mining. ClassificationData Mining. Classification
Data Mining. ClassificationSSA KPI
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveEMC
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopSkillspeed
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning Pranya Prabhakar
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduceAkhilesh Joshi
 

Viewers also liked (20)

Learning Linear Models with Hadoop
Learning Linear Models with HadoopLearning Linear Models with Hadoop
Learning Linear Models with Hadoop
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
Weka pentaho day2014-fidelis
Weka pentaho day2014-fidelisWeka pentaho day2014-fidelis
Weka pentaho day2014-fidelis
 
On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)On Implementation of Neuron Network(Back-propagation)
On Implementation of Neuron Network(Back-propagation)
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Lecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and TechnologyLecture 05 - The Data Warehouse and Technology
Lecture 05 - The Data Warehouse and Technology
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
Data Mining. Classification
Data Mining. ClassificationData Mining. Classification
Data Mining. Classification
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010
 
Design patterns in MapReduce
Design patterns in MapReduceDesign patterns in MapReduce
Design patterns in MapReduce
 

Similar to Decision trees in hadoop

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computingBAINIDA
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & RŁukasz Grala
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsenrusersla
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Advanced analytics with R and SQL
Advanced analytics with R and SQLAdvanced analytics with R and SQL
Advanced analytics with R and SQLMSDEVMTL
 

Similar to Decision trees in hadoop (20)

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobs
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Advanced analytics with R and SQL
Advanced analytics with R and SQLAdvanced analytics with R and SQL
Advanced analytics with R and SQL
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Recently uploaded

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 

Recently uploaded (20)

Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 

Decision trees in hadoop

  • 1. Decision Trees built in Hadoop plus more Big Data Analytics with Revolution R Enterprise All Rights Reserved, Revolution Analytics 2014 Mario Inchiosa, US Chief Scientist mario.inchiosa@revolutionanalytics.com Revolution Webinar – April 17, 2014
  • 2. 2 OUR COMPANY The leading provider of advanced analytics software and services based on open source R, since 2007 OUR SOFTWARE The only Big Data, Big Analytics software platform based on the data science language R SOME KUDOS Visionary Gartner Magic Quadrant for Advanced Analytics Platforms, 2014
  • 3. Typical Challenges our Customers Face Big Data  Many new data sources  Data variety & velocity  Data movement, memory limits 3 Production Efficiency  Shorter model shelf life  Volume of Models  Long end-to- end cycle time  Pace of decision accelerated Enterprise Readiness  Heterogeneous landscape  Write once, deploy anywhere  Skill shortage  Production support Complex Computation  Mathematically sophisticated  Parallelization  Experimentation  Ensemble models  Many small models  Simulation
  • 4. Polling Question:  What is your current analytics software platform? – Please select one • R/RRE • SAS • SPSS • Tibco/Spotfire • KXEN • Other
  • 6. What is R?  Most widely used data analysis software • Used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language • Flexible, extensible and comprehensive for productivity  Create beautiful and unique data visualizations • As seen in New York Times, Twitter and Flowing Data  Thriving open-source community • Leading edge of analytics research  Fills the talent gap • New graduates prefer R R is Hot bit.ly/r-is-hot WHITE PAPER
  • 7. Exploding growth and demand for R  R is the highest paid IT skill – Dice.com, Jan 2014  R most-used data science language after SQL – O’Reilly, Jan 2014  R is used by 70% of data miners – Rexer, Sep 2013  R is #15 of all programming languages – RedMonk, Jan 2014  R growing faster than any other data science language – KDnuggets, Aug 2013  More than 2 million users worldwide R Usage Growth Rexer Data Miner Survey, 2007-2013 70% of data miners report using R R is the first choice of more data miners than any other software Source: www.rexeranalytics.com
  • 8. REVOLUTION R ENTERPRISE THE BIG DATA BIG ANALYTICS PLATFORM
  • 9. Revolution R Enterprise  High Performance, Scalable Analytics  Portable Across Enterprise Platforms  Easier to Build & Deploy Analytics is…. the only big data big analytics platform based on open source R 9
  • 10. Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Single threaded Parallel threading Shrinks analysis time Enterprise Readiness Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R R is open source and drives analytic innovation but….has some limitations for Enterprises 10
  • 11. All of Open Source R plus:  Big Data scalability  High-performance analytics  Development and deployment tools  Data source connectivity  Application integration framework  Multi-platform architecture  Support, Training and Services 11 is the Big Data Big Analytics Platform
  • 12. R+CRAN • Open source R interpreter • UPDATED R 3.0.2 • Freely-available R algorithms • Algorithms callable by RevoR • Embeddable in R scripts • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math Available On: • IBM® Platform LSFTM Clusters • Microsoft® HPC Clusters • Windows® & Linux Servers • Windows® & Linux Workstations • NEW Cloudera® Hadoop • NEW Hortonworks® Hadoop • NEW Teradata® Database • Intel® Hadoop • IBM® BigInsightsTM • IBM® PureDataTM for Analytics, powered by Netezza technology 12 The Platform Step by Step: R Capabilities
  • 13. Revolution R Enterprise RevoR Performance Enhanced R Open Source R Revolution R Enterprise Computation (4-core laptop) Open Source R Revolution R Speedup Linear Algebra1 Matrix Multiply 176 sec 9.3 sec 18x Cholesky Factorization 25.5 sec 1.3 sec 19x Linear Discriminant Analysis 189 sec 74 sec 3x General R Benchmarks2 R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable 1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php 2. http://r.research.att.com/benchmarks/ Customers report 3-50x performance improvements compared to Open Source R — without changing any code 13
  • 14. DistributedR • Distributed computing framework • Delivers portability across platforms ConnectR • High-speed data import/export Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database TPT • ODBC (incl. Vertica, Oracle, Pivotal, Aster, SybaseIQ, DB2, MySQL) ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Correlation & covariance matrices • Predictive Models – linear, logistic, GLM • Machine learning • Monte Carlo simulation • NEW Tools for distributing customized algorithms across nodes DistributedR available on: • Windows Servers • Red Hat and NEW SuSE Linux Servers • IBM Platform LSF Linux Clusters • Microsoft HPC Clusters • NEW Cloudera Hadoop • NEW Hortonworks Hadoop • NEW Teradata Database 14 The Platform Step by Step: Parallelization & Data Sourcing
  • 15. DeployR • Web services software development kit for integration analytics via Java, JavaScript or .NET APIs • Integrates R Into application infrastructures Capabilities: • Invokes R Scripts from web services calls • RESTful interface for easy integration • Works with web & mobile apps, leading BI & Visualization tools and business rules engines DevelopR • Integrated development environment for R • Visual ‘step-into’ debugger Available on: • Windows DevelopR DeployR 15 The Platform Step by Step: Tools & Deployment
  • 16. 16 Scalable and Parallelized across Cores and Nodes 0010 COMPUTE NODE COMPUTE NODE MULTICORE PROCESSOR 4, 8, 16+ CORES Evaluate COMPUTE NODE COMPUTE NODE 0101 0010 1110 1100 01010 DATA PARTITION BIG DATA 010101010010101001010010010010100101010101 011010 SHARED MEMORY 100101 101001 111000 01010101001001010011100100100101001010101010101010100100 Combine Intermediate Results MASTER NODE CORE0 THREAD 0 CORE01 THREAD 1 CORE02 THREAD 2 CORE03 THREAD N
  • 17. ScaleR Scalability and Performance  Handles an arbitrarily large number of rows in a fixed amount of memory  Scales linearly with the number of rows  Scales linearly with the number of nodes  Scales well with the number of cores per node  Scales well with the number of parameters  Extremely high performance 17
  • 18.  Unique PEMAs: Parallel, external-memory algorithms  High-performance, scalable replacements for R/SAS analytic functions  Parallel/distributed processing eliminates CPU bottleneck  Data streaming eliminates memory size limitations  Works with in-memory and disk-based architectures 18 Eliminates Performance and Capacity Limits of Open Source R and Legacy SAS
  • 19.
  • 20. Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPCwire, April 21, 2011 See Revolution white paper for additional benchmarks. Logistic Regression: 20 Revolution R Enterprise ScaleR Outperforms SAS HPA at a Fraction of the Cost
  • 21. Specific speed-related factors  Efficient computational algorithms  Efficient memory management – minimize data copying and data conversion  Heavy use of C++ templates; optimal code  Efficient data file format; fast access by row and column  Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently Revolution R Enterprise 21
  • 22. Scalability and portability of Revolution Analytics “Parallel External Memory Algorithms” (PEMAs)  Anatomy of a PEMA: 1) Initialize, 2) Process Chunk, 3) Aggregate, 4) Finalize  Process a chunk of data at a time, giving linear scalability  Process an unlimited number of rows of data in a fixed amount of RAM  Independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions  Independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 22
  • 23. Simplified ScaleR Internal Architecture Revolution R Enterprise 23 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files, UDFs Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV, Fixed, XDF
  • 24. DistributedR ScaleR ConnectR DeployR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE In the Cloud Amazon AWS Workstations & Servers Windows Red Hat and SUSE Linux Clustered Systems IBM Platform LSF Microsoft HPC EDW Teradata IBM PureData™ for Analytics Hadoop Cloudera Hortonworks 24 Write Once. Deploy Anywhere.
  • 25. Decision Trees – Easy-to-interpret models – Widely used in a variety of disciplines. For example, Predicting which patient characteristics are associated with high risk of, for example, heart attack. Deciding whether or not to offer a loan to an individual based on individual characteristics. Predicting the rate of return of various investment strategies Retail target marketing  Can handle multi-level factor response easily  Useful in identifying important interactions Revolution R Enterprise 25
  • 26. Decision Tree Types  Classification tree: predict what ‘class’ or ‘group’ an observation belongs to (dependent variable is a factor)  Regression tree: predict the value of a continuous dependent variable Revolution R Enterprise 26
  • 27. 27 Polling Question:  What is your “go-to” tree algorithm for predictions in your work? – Please select one answer • Single Trees • Random Forests • It Depends • Neither
  • 28. Classification Example: Marketing Response Data set containing the following information:  Response: Was response to a phone call, email, or mailing?  Age  Income  Marital status  Attended college? Revolution R Enterprise 28
  • 29. Estimating the model treeOut <- rxDTree(response ~ age + income + college + marital, data = rdata) Revolution R Enterprise 29
  • 30. Simple Example: Text Output – Information on the split, the number of observations in the node, the “loss”, the predicted value, and the probabilities 1) root 10000 4069 Email (0.33260000 0.59310000 0.07430000) 2) college=No College 5074 2378 Phone (0.53133622 0.38943634 0.07922743) 4) age>=39.5 2518 330 Phone (0.86894361 0.00000000 0.13105639) 8) age< 64.5 2256 77 Phone (0.96586879 0.00000000 0.03413121) * 9) age>=64.5 262 9 Mail (0.03435115 0.00000000 0.96564885) * 5) age< 39.5 2556 580 Email (0.19874804 0.77308294 0.02816901) 10) marital=Single 835 371 Phone (0.55568862 0.40958084 0.03473054) 20) income>=29.5 472 14 Phone(0.97033898 0.00000000 0.02966102) * 21) income< 29.5 363 21 Email(0.01652893 0.94214876 0.04132231) * 11) marital=Married 1721 87 Email(0.02556653 0.9494480 .02498547) * 3) college=College 4926 971 Email (0.12789281 0.80288266 0.06922452) … Revolution R Enterprise 30
  • 32. The ‘Big Data’ Decision Tree Algorithm  Classical algorithms for building a decision tree sort all continuous variables in order to decide where to split the data.  This sorting step becomes prohibitive when dealing with large data.  rxDTree bins the data rather than sorting, computing histograms to create empirical distribution functions of the data  rxDTree partitions the data “horizontally”, processing in parallel different subsets of the observations  The accuracy of the parallel tree approximately equals that of the serial tree (Ben-Haim & Tom-Tov, 2010) Revolution R Enterprise 32
  • 33. Useful rxDTree Arguments for Big Data  maxDepth: maximum tree depth  minBucket: minimum number of observations in a terminal node  minSplit: minimum number of observations needed to split  cp: minimum fit improvement needed to accept a split  maxNumBins: maximum number of bins used to bin numeric data Revolution R Enterprise 33
  • 34. 34 Related Decision Tree Functions  prune.rxDTree – model simplification  rxDTreeBestCp – optimizes pruning  rxAddInheritance – makes rxDTree compatible with rpart for printing and plotting  createTreeView – interactive HTML graphics  rxPredict.rxDTree – scores new data using rxDTree model  rxDForest – Ensembles of Decision Trees – big data alternative to randomForest package  rxVarImpPlot – plots variable importance as measured by rxDForest  rxPredict.rxDForest – scores new data using rxDForest model
  • 35. Sample code for Decision Trees on workstation # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxDTree( ArrDelay ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + CRSDepTime, data=airData ) 35
  • 36. Sample code for Decision Trees on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxDTree( ArrDelay ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + CRSDepTime, data=airData ) 36
  • 37. Write Once  Deploy Anywhere rxSetComputeContext("local") # DEFAULT rxSetComputeContext(RxHadoopMR(<data, server environment arguments>)) # Summarize and calculate descriptive statistics adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS) # Fit Linear Model arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS) rxSetComputeContext(RxHpcServer(<data, server environment arguments>)) rxSetComputeContext(RxLsfCluster(<data, server environment arguments>)) Same code to be run anywhere ….. Local System     Set the desired compute context for data analytics execution….. rxSetComputeContext(RxInTeradata(<data, server environment arguments>))
  • 38. 38 Polling Question:  What platforms are you most interested in running tree models on your data? – Please select all that apply • Server • Grid • Hadoop • Teradata • Other
  • 39. Revolution R Enterprise ScaleR: High Performance Big Data Analytics Data Prep, Distillation & Descriptive Analytics R Data Step Descriptive Statistics Statistical Tests Sampling  Data import – Delimited, Fixed, SAS, SPSS, ODBC  Variable creation & transformation using any R functions and packages  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Subsample (observations & variables)  Random Sampling
  • 40. Revolution R Enterprise ScaleR (continued) Statistical Modeling Machine Learning Predictive Models  Covariance/Correlation/Sum of Squares/Cross-product Matrix  Multiple Linear Regression  Logistic Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. - User defined distributions & link functions.  Classification & Regression Trees and Forests  Residuals for all models  Histogram  ROC Curves (actual data and predicted values)  Lorenz Curve  Line and Scatter Plots  NEW Tree Visualization Data Visualization Variable Selection  Stepwise Regression  Linear  NEW logistic  NEW GLM  Monte Carlo  Run open source R functions and packages across cores and nodes Cluster Analysis  K-Means Classification  Decision Trees  NEW Decision Forests  Prediction (scoring)  NEW PMML Export Simulation and HPC Deployment
  • 41. 41 Resources For You  Big Data Decision Trees with R – http://www.revolutionanalytics.com/whitepaper/big-data-decision-trees-r  Advanced, Big Data Analytics with R and Hadoop – http://www.revolutionanalytics.com/whitepaper/advanced-big-data-analytics-r- and-hadoop  Revolution R Enterprise: Faster than SAS – http://www.revolutionanalytics.com/whitepaper/revolution-r-enterprise-faster- sas  May 13, 2014 – Webinar presenting the results of our RRE vs. SAS benchmarking  To Get RRE for yourself, please visit: – http://www.revolutionanalytics.com/get-revolution-r-enterprise
  • 42. 42
  • 43. Thank you Revolution Analytics is the leading commercial provider of software and support for the popular open source R statistics language. mario.inchiosa@revolutionanalytics.com www.revolutionanalytics.com, 1.855.GET.REVO, Twitter: @RevolutionR 43