Sesja or ozwiązaniu Big Data Analytics Microsoft. Jest to Hortonowrks (HADOOP, HBase, Storm, Spark), wraz z wydajnym R Server. Zaawansowana analityka przy użyciui RevoScaleR
2. Łukasz Grala
• Senior architekt rozwiązań Platformy Danych & Business Intelligence & Zaawansowanej Analityki w TIDK
• Twórca „Data Scientist as as Service”
• Certyfikowany trener Microsoft i wykładowca na wyższych uczelniach
• Autor zaawansowanych szkoleń i warsztatów, oraz licznych publikacji i webcastów
• Od 2010 roku wyróżniany nagrodą Microsoft Data Platform MVP
• Doktorant Politechnika Poznańska – Wydział Informatyki (obszar bazy danych, eksploracja danych, uczenie maszynowe)
• Prelegent na licznych konferencjach w kraju i na świecie
• Posiada liczne certyfikaty (MCT, MCSE, MCSA, MCITP,…)
• Członek Polskiego Towarzystwa Informatycznego
• Członek i lider Polish SQL Server User Group (PLSSUG)
• Pasjonat analizy, przechowywania i przetwarzania danych, miłośnik Jazzu
email lukasz@tidk.pl
3. Big Data – 4V
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
lukasz@tidk.pl
8. HDFS
constitution.txt The mappers read the file’s
blocks from HDFS line-by-line
1
We the people, in order to form a...
The lines of text are split into
words and output to the
reducers
2
The shuffle/sort phase
combines pairs with the same
key
3
The reducers add up the “1’s”
and output the word and its
count
4
<We, 1>
<the,1>
<people,1>
<in,1>
<order, 1>
<to,1>
<form,1>
<a,1>
<We, (1,1,1,1)>
<the, (1,1,1,1,1,1,1,...)>
<people,(1,1,1,1,1)>
<form, (1)><We,4>
<the,265>
<people,5>
<form,1>HDFS
WordCount in MapReduce
lukasz@tidk.pl
9. What is Apache Ambari?
A completely open source
management platform for
provisioning, managing,
monitoring and securing
Apache Hadoop clusters.
Apache Ambari takes the
guesswork out of operating
Hadoop.
lukasz@tidk.pl
10. Spark’s Position in a Modern Data Platform
Disk Based
Source
Streaming Source
Reference Data
Stream Processing
Storm/Spark-Streaming
Data Pipeline
Hive/Pig/Spark
Long Term Data
Warehouse
Hive + ORC
Data Discovery
Operational
Reporting
Business
Intelligence
Ad Hoc/On
Demand Source
Data Science
Spark-ML, Spark-SQL
Advanced Analytics
Data Sources Data Processing, Storage & Analytics Data Access
lukasz@tidk.pl
11. Spark Context
Main entry point for Spark functionality
Represents a connection to a Spark cluster
Represented as sc in your code
What is it?
lukasz@tidk.pl
14. HDInsight
• HDInsight is a Hadoop-based service that brings 100%
Apache Hadoop solution running on the Microsoft Azure
platform
• Based on the Hortonworks Data Platform (HDP)
• Scalable, on-demand service
lukasz@tidk.pl
15. CRAN: 7000+ add-on packages for R
CRAN Task View by Barry Rowlingson: http://www.maths.lancs.ac.uk/~rowlings/R/TaskViews/
lukasz@tidk.pl
16. 1993 Research project in Auckland, NZ
• Ross Ihaka and Robert Gentlemen
1995 Released as open-source software
• Generally compatible with the “S” language
1997 R core group formed
2003 R Foundation formed in Austria
2007 Revolution Analytics founded
2014 Revolution R Open launched
2015 R Consortium founded
2015 Microsoft acquires Revolution Analytics
2016 Microsoft R Open 3.2.3 released
A brief history of R
Photo credit: Robert Gentleman
lukasz@tidk.pl
17. R: The #1 software for Data Science
… and #6 amongst general-purpose programming languages
R Usage Growth
Rexer Data Miner Survey, 2007-2015
Language Popularity
IEEE Spectrum Top Programming Languages, 2015
76% of analytic
professionals
report using R
36% select R as
their primary tool
lukasz@tidk.pl
18. Use Microsoft R Open with…
Microsoft R Server Big-data analytics and distributed computing on Linux, Hadoop and Teradata
SQL Server 2016 Big-data analytics integrated with SQL Server database
PowerBI Computations and charts from R scripts in dashboards
Azure ML Studio R Scripts in cloud-based Experiment workflows
Visual Studio R Tools for Visual Studio: integrated development environment for R
HDInsights R integrated with cloud-based Hadoop clusters
Cortana Analytics Cloud-based R APIs and Virtual Machines
lukasz@tidk.pl
19. The Microsoft R Server Platform
ROpen MicrosoftRServer
DeployRDevelopR
ConnectR
• High-speed & direct connectors
Available for:
• High-performance XDF
• SAS, SPSS, delimited & fixed
format text data files
• Hadoop HDFS (text & XDF)
• Teradata Database & Aster
• EDWs and ADWs
• ODBCScaleR
• Ready-to-Use high-performance
big data big analytics
• Fully-parallelized analytics
• Data prep & data distillation
• Descriptive statistics & statistical tests
• Range of predictive functions
• User tools for distributing customized R algorithms across nodes
• Wide data sets supported – thousands of variables
DistributedR
• Distributed computing framework
• Delivers cross-platform portability
R+CRAN
• Open source R interpreter
• R 3.2.5
• Freely-available huge range of R algorithms
• Algorithms callable by RevoR
• Embeddable in R scripts
• 100% Compatible with existing R scripts,
functions and packages
RevoR
• Performance enhanced R interpreter
• Based on open source R
• Adds high-performance math library
to speed up linear algebra functions
lukasz@tidk.pl
20. Toolkits for data scientists and numerical analysts to create custom
parallel and distributed algorithms
ParallelR: parallel programming for multi-CPU servers and grids
RHadoop: map-reduce programming in R language
Mainly useful for “embarrassingly parallel” problems, where parallel
components work with small amounts of data
Big Data Predictive Analytics mostly not embarrassingly parallel
80+ pre-built “parallel external memory algorithms” included with RevoScale
Azure ML Studio includes many ML algorithms
R Packages: RHadoop and ParallelR
lukasz@tidk.pl
21. ScaleR – Parallel + “Big Data”
Stream data in to RAM in blocks. “Big Data” can be any data size. We
handle Megabytes to Gigabytes to Terabytes…
Our ScaleR algorithms work
inside multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to produce
the output on the entire data setXDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
lukasz@tidk.pl
22. MRS and Hadoop Architecture options
R R R R R
R R R R R
ScaleR Production
RStudio Server Pro
Microsoft R Server
1. Copy
2. Stream
3. Send
lukasz@tidk.pl
23. DistributedR - Hadoop Processing Methods
Method 1: Local (Linux) parallel processing using all
cores on one node, copying data from HDFS to store
in local Linux file-system.
Compute Context
HadoopCompute Context
HadoopCompute Context
Local Parallel
Linux (Local)
File-System
HDFS
Csv, Xdf
Processing
Data
1 Edge node 1:n data nodes
1:n disks 1:(n x number of
nodes) disks
Csv, Xdf
Linux FS
Read / write
Method 1
(“Beside” or “Edge”)
Copy to
Local
File
Method 2: Local (Linux) parallel processing using all
cores on one node, streaming data from / to HDFS
Compute Context
HadoopCompute Context
HadoopCompute Context
Local Parallel
Compute Context
Hadoop
Linux (Local)
File-System
HDFS
Csv, Xdf
1:n nodes
1:n disks 1:(n x number of
nodes) disks
1 Edge node
lukasz@tidk.pl
24. Method 3
Method 3: Hadoop (Map-Reduce) parallel processing
using all cores on n nodes, using HDFS data on each
node
Compute Context
HadoopCompute Context
HadoopCompute Context
Local Parallel
Compute Context
Hadoop
Linux (Local)
File-System
HDFS
Csv, Xdf
Processing
Data
1:n nodes
1:n disks 1:(n x number of
nodes) disks
Csv, Xdf
HDFS
Read / write
(“inside”)
R script
sent to
data
nodes
1 Edge node
R model script sent to Master Node:
1. Starts a master process
2. Distribute work
3. Master tasks for each node
4. Master initiates distributed work
1.Hadoop schedules mapper for each split
2.Algorithm computes intermediate result
3.Reducer combines intermediate results
5. Master process evaluates
completion
6. Iterates as required by the
algorithm
7. Returns consolidated answer to
script
lukasz@tidk.pl
25. DistributedR - What processing mode to
use, when?
Analytic data set size and processing complexity (e.g. simple summary statistics vs iterative algorithm)
guide the use of Method 1 and 2 (Edge Node / Server Linux local processing) vs Method 3 (in-Hadoop
processing)
Low Medium High
Small Data
< 10GB
Medium Data
< 50GB
Bigger Data
> 50GB
Edge Node Linux
processing
In-Hadoop
processing
Local Linux
file-system
Hadoop
file-system
Legend
Processing
Complexity
Data Size
lukasz@tidk.pl
26. Parallelized Algorithms
• Data import – Delimited, Fixed, SAS, SPSS, OBDC
• Variable creation & transformation
• Recode variables
• Factor variables
• Missing value handling
• Sort, Merge, Split
• Aggregate by category (means, sums)
• Min / Max, Mean, Median (approx.)
• Quantiles (approx.)
• Standard Deviation
• Variance
• Correlation
• Covariance
• Sum of Squares (cross product matrix for set variables)
• Pairwise Cross tabs
• Risk Ratio & Odds Ratio
• Cross-Tabulation of Data (standard tables & long form)
• Marginal Summaries of Cross Tabulations
• Chi Square Test
• Kendall Rank Correlation
• Fisher’s Exact Test
• Student’s t-Test
• Subsample (observations & variables)
• Random Sampling
Data Step Statistical Tests
Sampling
Descriptive Statistics
• Sum of Squares (cross product matrix for set variables)
• Multiple Linear Regression
• Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian,
Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions & link
functions.
• Covariance & Correlation Matrices
• Logistic Regression
• Classification & Regression Trees
• Predictions/scoring for models
• Residuals for all models
Predictive Models
• K-Means
• Decision Trees
• Decision Forests
• Stochastic Gradient Boosted Decision Trees
Cluster Analysis
Classification
Simulation
Variable Selection
• Stepwise Regression Linear, Logistic
and GLM
• Monte Carlo
• Parallel Random Number Generation
Combination
• Using Revolution rxDataStep and rxExec functions
to combine open source R with Revolution R
• PEMA API
lukasz@tidk.pl
29. Performance Comparison
US flight data for 20 years
Linear Regression on Arrival Delay
Run on 4 core laptop, 16GB RAM and 500GB SSD
Microsoft R Server has no data size limits in relation to size of available RAM. When open source R operates on data
sets that exceed RAM it will fail. In contrast Microsoft R Server scales linearly well beyond RAM limits and parallel
algorithms are much faster.
lukasz@tidk.pl