This document discusses parallel external memory algorithms (PEMAs) and their application to generalized linear models (GLMs). PEMAs allow external memory algorithms to be parallelized and run on multiple cores and computers. The document describes arranging GLM code into four functions - Initialize, ProcessData, UpdateResults, and ProcessResults - to create a PEMA. It also discusses an implementation of GLM using this approach in C++ and R that can efficiently use multiple cores and nodes for extremely high performance on large datasets. Benchmark results demonstrate linear scaling of this implementation with large numbers of rows and nodes.
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
Parallel External Memory Algorithms Applied to Generalized Linear Models
1. Parallel External
Memory Algorithms
applied to
Generalized Linear
Models
Lee E. Edlefsen, Ph.D.
Chief Scientist
JSM 2012
1
2. Introduction and overview Revolution Confidential
For the past several decades the rising tide of
technology has allowed the same data analysis code
to handle the increase in sizes of typical data sets.
That era is ending. The size of data sets is increasing
much more rapidly than the speed of single cores, of
RAM, and of hard drives.
To deal with this, statistical software must be able to
use multiple cores and computers. Parallel external
memory algorithms (PEMA’s) provide a foundation for
such software.
2
3. Introduction and overview – (2) Revolution Confidential
External memory algorithms (EMA’s) are those that
do not require all data to be in RAM, and are widely
available.
Parallel implementations of EMA’s allow them to
run on multiple cores and computers, and to
process unlimited rows of data.
This paper describes a general approach to
efficiently parallelizing EMA’s, using an R and C++
implementation of generalized linear models (GLM)
as a detailed example.
Revolution R Enterprise 3
4. Introduction and overview – (3) Revolution Confidential
This paper discusses:
the arrangement of code for “automatic” parallelization
the efficient use of cores
the efficient use of multiple computers (nodes)
The approach presented is independent of the
distributed computing platform (MPI, Hadoop, MPP
database appliances)
The paper includes billion row benchmarks
showing linear scaling with rows and nodes, and
demonstrating that extremely high performance is
achievable
Revolution R Enterprise 4
5. High Performance Computing vs High Revolution Confidential
Performance Analytics
HPA is HPC + Data
High Performance Computing is CPU centric
Lots of processing on small amounts of data
Focus is on cores
High Performance Analytics is data centric
Less processing per amount of data
Focus is on feeding data to the cores
On disk I/O, data locality
On efficient threading, data management in RAM
Revolution R Enterprise 5
6. High Performance Analytics in RevoScaleR Revolution Confidential
Extremely high performance data management
and data analysis
Scales from small local data to huge distributed
data
Scales from laptop to cluster to cloud
Based on a platform that “automatically” and
efficiently parallelizes and distributes a broad class
of predictive analytic algorithms
This platform implements the approach to parallel
external memory algorithms I will describe
Revolution R Enterprise 6
7. External memory algorithms Revolution Confidential
External memory algorithms are those that allow
computations to be split into pieces so that not all data
has to be in memory at one time
Such algorithms process data a “chunk” at a time,
storing intermediate results from each chunk and
combining them at the end
Each chunk must produce an intermediate result that
can be combined with other intermediate results to give
the final result
Such algorithms are widely available for data
management and predictive analytics
7
8. Parallel external memory algorithms Revolution Confidential
(PEMA’S)
PEMA’s are external memory algorithms that have
been parallelized
Such algorithms process data a chunk at a time in
parallel, storing intermediate results from each
chunk and combining them at the end
External memory algorithms that are not “inherently
sequential” can be parallelized
Results for one chunk of data cannot depend upon prior
results
Data dependence (lags, leads) is OK
Revolution R Enterprise 8
9. Generalized Linear Models (GLM) Revolution Confidential
The generalized linear model can be thought of as
a generalization of linear regression
It extends linear regression to handle dependent
variables that are generated from exponential
distribution functions, including Gaussian, Poisson,
logistic, gamma, binomial, multinomial, and
tweedie
Generalized linear models are widely used in a
variety of fields and industries
Revolution R Enterprise 9
10. GLM overview Revolution Confidential
The dependent variable Y is generated from a
distribution in the exponential family
The expected value of Y is related to a linear
predictor of the data X and parameters β through
the inverse of a “link” function g():
E(Y) = mu = g-1(Xβ)
The variance of Y is typically a function V() of the
mean mu:
Var(Y) = varmu = V(mu)
Revolution R Enterprise 10
11. GLM Estimation Revolution Confidential
The parameters of GLM models can be estimated
using maximum likelihood
Iteratively reweighted least squares (IRLS) is
commonly used to obtain the maximum likelihood
estimates
Each iteration of IRLS requires at least one pass
through the data, generating a vector of weights
and a “new” dependent variable and then doing a
weighted least squares regression
Revolution R Enterprise 11
12. IRLS for GLM Revolution Confidential
Given an estimate of the parameters β and the
data X, IRLS requires the computation of a “weight”
variable W and a “new” dependent variable Z:
eta = Xβ
mu = linkinv(eta)
Z = (y-mu)/mu_eta, where mu_eta is the partial of mu with respect to eta
W = sqrt(mu_eta*mu_eta)/varmu
The next β is then computed by regressing Z on X,
weighted by W
If the estimation has not converged, the steps are
repeated
Revolution R Enterprise 12
13. In-memory implementations Revolution Confidential
The glm() function in R provides a beautiful and
efficient in-memory implementation
However, nearly every computational line of code
involves processing all rows of data
There is no easy way to directly convert an
implementation like this into an implementation that
can handle data too big to fit into memory and that
can use multiple cores and multiple computers
However, it can be accomplished by arranging the
same computations into separate functions that
accomplish separate tasks
Revolution R Enterprise 13
14. Example external memory algorithm for the
mean of a variable Revolution Confidential
Initialization function: total=0, count=0
ProcessData function: for each block of x; total =
sum(x), count=length(x)
UpdateResults function: total12 = total1 + total2
ProcessResults function: mean = combined total /
combined count
14
15. A formalization of PEMA’s Revolution Confidential
Arrange the code into 4 functions:
1. Initialize(): does any necessary initialization
2. ProcessData(): takes a chunk of data and
produces an intermediate result (IR); this is the only
function run in parallel; it must assume it does not have
all data; it must produce no side-effects
3. UpdateResults(): takes two IR’s and produces
another IR that is equivalent to the IR that would have
been produced by combing the two corresponding
chunks of data and calling ProcessData()
4. ProcessResults(): takes any given IR and
converts it into a “final results” (FR) form
Revolution R Enterprise 15
16. An external memory algorithm for GLM Revolution Confidential
Initialization function: set intermediate values to 0
ProcessData function: for given β and chunk of
data X, compute Z, W and M, the weighted cross
products matrix of X and Z for this chunk
eta = Xβ, mu = linkinv(eta)
Z = (y-mu)/mu_eta, W = sqrt(mu_eta*mu_eta)/varmu
M = [X*W Z*W]’[X*W Z*W]
UpdateResults function:
M12 = M1 + M2
ProcessResults function:
β = Solve(M) (solves a set of linear equations)
Check for convergence and repeat if necessary
Revolution R Enterprise 16
17. A C++ and R implementation of GLM Revolution Confidential
C++ “analysis” objects
Have 4 virtual PEMA methods, among others
Have member variables for intermediate results
and for maintaining local state
Know how to copy themselves (including ability
to not copy some members, for efficiency)
Have ability to call into R during ProcessData()
R “family” objects for glm
Contain methods for computing Z, W (eta, mu,
etc)
Revolution R Enterprise 17
18. GLM in C++ and R: Multiple Cores Revolution Confidential
On each computer, a master analysis object makes
a copy of itself for all usable threads (cores)
except one
The remaining thread is assigned to handle all I/O
In a master loop over the data, the I/O object reads
a chunk of data
In parallel (after the first read), portions of the
previously-read chunk are (virtually) passed to the
ProcessData() methods of the other objects
Revolution R Enterprise 18
19. GLM in C++ and R: Multiple Cores – (2) Revolution Confidential
For each chunk of data, Z,W are computed (in R
or C++; if in R, only on 1 thread at a time is
allowed); Xβ and M are computed in C++
After all data has been consumed, the master
analysis object loops over all of the thread-specific
objects and updates itself (using UpdateResults()),
resulting in the intermediate results object that
corresponds to all of the data processed on this
computer
If other computers are being used, this computer
sends it intermediate results to the “master” node
Revolution R Enterprise 19
20. GLM in C++ and R: Multiple MPI Nodes Revolution Confidential
A “master node” sends a copy of the analysis
object, or instructions on how to create one, to
each computer (node) on a cluster/grid, and the
steps described above are carried out
Each node reads and processes its portion of the
data (the more local the data the better)
Worker nodes do not communicate with each other
Worker nodes do not communicate with the master
node except for sending their results
Revolution R Enterprise 20
21. GLM in C++ and R: Multiple MPI Nodes (2) Revolution Confidential
When each node has its final IR object, it sends it
to the master node
The master node gathers and combines all
intermediate results using UpdateResults()
When it has the final intermediate results, it calls
ProcessResults() to get next estimate of β
The master node checks for convergence, and
repeats all of the steps if necessary
Revolution R Enterprise 21
22. Implementation in RevoScaleR Revolution Confidential
The package RevoScaleR, which is part of
Revolution R Enterprise, contains an
implementation of GLM and other algorithms based
on this approach
The algorithms are internally threaded
They can currently use MPI or RPC for inter-
process communication
Supports Platform LSF and HPC Server
schedulers
We are currently working on supporting Hadoop
Revolution R Enterprise 22
23. Some features of this implementation Revolution Confidential
Handles an arbitrarily large number of rows in a
fixed amount of memory
Scales linearly with the number of rows
Scales (approximately) linearly with the number
of nodes
Scales well with the number of cores per node
Scales well with the number of parameters
Works on commodity hardware
Extremely high performance
Revolution R Enterprise 23
24. Scalability of linear regression with rows
1 million - 1 billion rows, 443 betas
Revolution Confidential
(4 cores)
Time (secs)
1200
~ 1.1 million rows/second
1000
800
600
400
200
0
0 200 400 600 800 1000 1200
Revolution R Enterprise 24
25. Scalability of glm (logit) with rows
1 million - 1 billion rows, 443 betas Revolution Confidential
(4 cores)
Time (secs)
4000
3500
3000
2500
2000
1500
1000
500
0
0 200 400 600 800 1000 1200
Revolution R Enterprise 25
26. Scalability with nodes: glm (logit) Revolution Confidential
Big (1B rows) and Small (124M rows) data
Big (443 params) and Small (7 params) models (4 cores per node)
Big Data, Big Model
(Super scaling)
5 iterations per model
Big Data, Small Model
Small Data, Big Model
Linear scaling reference
Small Data, Small Model
Revolution R Enterprise 26
27. Timing comparisons Revolution Confidential
glm() in CRAN R vs rxGlm in RevoScaleR
SAS’s new HPA functionality vs rxGlm
Revolution R Enterprise 27
29. Revolution Confidential
HPA Benchmarking comparison* – Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a
20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-
loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011 29
30. Conclusion Revolution Confidential
PEMA’s provide a systematic approach to scalable
analytic algorithms
Algorithms implemented in this way can handle
unlimited numbers of rows on a single core in a
fixed amount of RAM
Such algorithms scale well with rows and nodes,
and scale well with cores up to a point
Work on commodity hardware
Work on different distributed computing platforms
Extremely high performance is possible
Revolution R Enterprise 30
31. Thank you! Revolution Confidential
R-Core Team
R Package Developers
R Community
Revolution R Enterprise Customers and Beta
Testers
Colleagues at Revolution Analytics
Contact:
lee@revolutionanalytics.com
31