Advanced Analytics in Hadoop

Advanced Analytics in Hadoop
Thomas W. Dinsmore
1

• Use cases
• Architectures
• Current Options:
• Open Source
• Commercial
2

Analytics
3
Ad Hoc Queries
ReportsData Access
Visualization
Data Manipulation
OLAP/ROLAP etc
Advanced Discovery
Predictive Analytics
Optimization
Simulation
Text Analytics
Geospatial Analytics
Econometrics
Dashboards
Scorecards
Streaming Analytics
Computational Complexity

Advanced Analytics
4
Advanced Discovery
Optimization
Simulation
Text Analytics
Econometrics
Streaming Analytics
Computational Complexity

Advanced Analytics
5
Advanced Discovery
Optimization
Simulation
Text Analytics
Econometrics
Streaming Analytics
Feature Extraction
Dimension Reduction

For some use cases, you must use all of the data.
8
Anomaly
Detection
Afﬁnity
Analysis
Clustering
Social
Network
Analysis
Collaborative
Filtering

For others, using all of the data is worth it.
9
Catastrophic Risk Modeling
Modeling with Fine-grained
Behavioral Data

10
1. Apache Mahout!
2. Code it yourself.!
3. …
Your Options (2013)

Legacy Alongside
12
HDFS HDFS HDFS HDFS HDFS HDFS
Data

Legacy Pass-Through
13
MapReduce
Data

MapReduce Push-Down
14
MapReduce
Advantages!
• Co-exists w/ other applications
• Integrated workload management
• Simpliﬁed administration
Disdvantages!
• MapReduce latency

Co-Located In-Memory (Asymmetric)
15
YARN
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
Advantages!
• Easy to adapt legacy apps
• Isolates analytic workload
Disdvantages!
• Data moves within the cluster
• Requires YARN

Co-Located In-Memory (Symmetric)
16
HDFS
Map!
Reduce
YARN
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
HDFS
Map!
Reduce
Advantages!
• Lowest latency
Disdvantages!
• Upgrade every node
• Requires YARN

Summary: Architecture
• MapReduce Push-Down is current “champion”
• Stable
• Co-exists well with Hadoop ecosystem
• MR 1.0 penalizes performance
• Required: persistent in-memory processing
• YARN enables co-location
17

Apache Mahout
• Apache incubator project (2007)
• Machine learning library
• Included in most distributions
• Thin acceptance, few contributors
• Diverse architecture
• Single-node
• MapReduce
• New algos run on Spark
• Recently cleaned up
19

Apache Giraph
• Apache top-level project
• Runs in MapReduce
• Dedicated graph engine
• Used by Facebook, few others
• Dead in the water
• No presence in leading distros
• No signiﬁcant commercial support
• No releases in 13 months
• No recent code commits on Git
20

GraphLab
• Carnegie Mellon project (2009)
• Distributed in-memory engine:
• Primarily graph analysis
• Selected machine learning algos
• Interface from Java, JavaScript,
Python
• GraphLab Inc provides commercial
support (2013, $6.75MM)
• Independent distribution, or through
Pivotal
21

0xdata H2O
• Vendor-driven open source project
• 0xdata sells support, customization
• Distributed in-memory prediction engine
• Multiple deployment options:
• Standalone (with HDFS)
• Over YARN
• In MapReduce
• Claims 2,000+ users
• 4 public references
• Used by a leading P&C insurer
• Java, R, Python and Scala interfaces
22

Apache Spark
• Top-level Apache project (2/14)
• Release 1.0 (5/14)
• Distributed in-memory analytics
• Machine learning
• Graph analytics
• Streaming analytics
• Fast SQL
• Compatible with Hadoop storage
• Integrated with YARN
• Scala, Python, Java interfaces (+SparkR)
• Growing ecosystem
• Supported in leading Hadoop distributions
23

Apache Spark: Hadoop Distributions
24
Spark Components
MLLIB GraphX Spark Streaming Spark SQL Shark
Cloudera Yes Yes Yes Yes (Impala)
Hortonworks Yes (Storm) (Stinger)
MapR Yes Yes Yes Yes Yes
Pivotal Yes Yes Yes Yes Yes
IBM BigInsights

Summary: Open Source Projects
25
0xdata !
H2O 2.2
Apache !
Giraph 1.1
Apache !
Mahout 0.9
Apache !
Spark 1.0
GraphLab 2.2
Status Independent Top-Level Top-Level Top-Level Independent
Architecture
Co-Located Memory-
Centric
MapReduce MapReduce
Co-Located Memory-
Centric
Co-Located Memory-
Centric
Interfaces Java, Python, R, Scala Java Java
Java, Python, Scala
(SparkR)
Python
Commercial Support 0xdata Databricks GraphLab, Inc.
Distribution Independent Independent
All Hadoop
Distributions
Cloudera!
Hortonworks!
MapR!
Pivotal
Independent

Analytic Features
26
0xdata !
H2O 2.2
Apache !
Giraph 1.1
Apache !
Mahout 0.9
Apache !
Spark 1.0
GraphLab 2.2
Prediction +++ + +++
Dimension Reduction + +++ + +
Clustering + +++ + +++
Collaborative Filtering +++ + +++
Text Analytics +++ +++
Matrix Operations + +++ +
Graph Analysis + + +++

Analytic Features: Prediction
27
Mahout 0.9 Spark 1.0 H2O 2.2
Linear Regression +
Logistic Regression +
Generalized Linear Models +
Naive Bayes + + +
Decision Tree +
Gradient Boosted Trees +
Random Forests + +
Linear Support Vector Machine +
Deep Learning (Backprop MLP) +

Analytic Features: Dimension Reduction
28
Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2
Singular Value
Decomposition + +
Lanczos Algorithm + +
Stochastic SVD +
Principal Components
Analysis + + +

Analytic Features: Clustering
29
k-Means + + + +
Fuzzy k-Means +
Streaming k-Means +
Spectral Clustering + +

Analytic Features: Collaborative Filtering
30
Item-Based +
Matrix Factorization with ALS + + +
Matrix Factorization with ALS,
Implicit Feedback +
ALS with Parallel Coordinate
Descent +
Weighted ALS +
Sparse ALS +

Analytic Features: Text Analytics
31
Latent Dirichlet Allocation + +
Frequent Pattern Mining +
Collocations +

Matrix Operations
32
Stochastic Gradient
Descent + +
Limited-Memory BFGS +
RowSimilarityJob +
ConcatMatrices +

Summary: Open Source
• Giraph is toast
• Mahout may be recovering from roadkill status
• GraphLab outperforms Spark GraphX today in graph analytics
• 0xdata H2O outperforms Spark MLLib today in machine learning
• Spark catching up fast
• More resources and distribution
• Integrated platform for ML and graph analysis
33

Alpine
• Business user interface
• Collaboration environment
• Broad library of techniques
• Strong cloud offering
• Leverages Hadoop (multiple distros), Hawq or
Pivotal Greenplum
• Push-down MapReduce
• Certiﬁed on Spark
• Small but growing customer base
35

IBM SPSS Analytics Server
• Introduced 2013
• Serves as “back end” for SPSS
Modeler
• Uses push-down MR
• Limited analytic feature set
• IBM supports on multiple Hadoop
distros
• Customer acceptance unknown
36

Revolution Analytics ScaleR
• ScaleR library of distributed statistics,
machine learning functions
• Tools to distribute arbitrary R functions
• Runs in Cloudera, Hortonworks, Teradata, LSF
clusters, MS HPC
• Hadoop edition uses MR push-down
• Tools simplify installation in large clusters
• R interface
• Partnerships with Alteryx, Qlik, MicroStrategy,
Tableau provide business interfaces
37

Skytree Server
• Georgia Tech’s FastLab project, repurposed as
commercial software
• Distributed machine learning platform
• Very opaque about technical details
• User interface is an API
• Co-located in Hadoop under YARN
• Just certiﬁed by Hortonworks
• No new public references in a year
• Used by leading credit card company
38

SAS High-Performance Analytics
• Distributed in-memory analytics
• Designed to run in special-purpose appliances (2011)
• Repurposed to run in Hadoop (2013)
• Co-exists poorly — cannot run SAS and MapReduce at
the same time
• Reads entire dataset into memory
• Uses MPI to communicate among nodes
• Requires upgrades from standard Hadoop infrastructure
• No public references
• Generic success stories missing from Strata presos
39

SAS LASR Server
• SAS’ “other” distributed in-memory platform
• Back end for several end-user products
• SAS Visual Analytics (2012)
• SAS Visual Statistics (New)
• SAS In-Memory Statistics for Hadoop (New)
• Recently added statistics and machine learning
• Does not read raw HDFS; must be transformed to proprietary
SASHDAT
• Like HPA, reads entire dataset into memory.
• 16 Core 256GB node can load 75GB table
• Runs DS2 programs, not Legacy SAS programs
• Fast, but with limited feature set
• SAS claims 1,400 “sites” for Visual Analytics
• Many of those are standalone boxes
40

Summary: Commercial
• Alpine’s interface is compelling to business user
• IBM Analytics Server is a good ﬁrst release
• RRE ScaleR appeals to R users, plays well in Hadoop sandbox
• Skytree Server: strong in prediction
• SAS: why two competing memory-centric architectures?
41

Progress
• Spark: blindingly fast maturity
• Rapidly expanding library of analytic features
• Growing developer community, ecosystem
• Commercial: from zero to many
42

Interesting Questions
• Will Mahout get a second wind?
• Will Spark MLLib displace 0xdata?
• Will Spark GraphX catch up to GraphLab?
• Can Spark Streaming compete with Storm and commercial entrants?
• How quickly will customers adopt memory-centric architecture for analytics?
• What will Alpine and MicroStrategy do with Spark?
• Will IBM distribute Spark in BigInsights?
• When will SAS announce a reference customer for HPA/LASR in Hadoop?
43

Thomas W. Dinsmore
44

Advanced Analytics in Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Advanced Analytics in Hadoop

Similar a Advanced Analytics in Hadoop (20)

Más de AnalyticsWeek

Más de AnalyticsWeek (8)

Último

Último (20)

Advanced Analytics in Hadoop