Visualising is essential for data science process because it allows as to look at the portrait of our data and develop new hypotheses about our problem. However, visualisation does not scale very well as we are limited by the number of pixels in the our screen (at least for static graphics). This deck talks about the approach - Bin - Summarize - Smooth approach to visualise big data which has been developed by Hadley Wickham and then implemented in an R package in Bigvis.
2. Visualise Million
Data Points
x <- rnorm(1000000, mean=0, sd=2)
y <- rnorm(1000000, mean=0, sd=2)
xy <- data.frame(x,y)
Same order as the
Number of Pixels
on my MacBook Air
1400 x 900
Data
3. Data Sample
Sampling can be
effective (with
overweighting
unusual values)
Require multiple
plots or careful
tuning parameters
4. Data Sample
Model
Models are great as
they scale nicely.
But, visualisation is
required as
“I don’t know, what I
don’t know.”
5. Data Sample
ModelBinning
Binning can solve a
lot of these
challenges
“Bin - Summarize -
Smooth: A framework
for visualising big data” -
Hadley Wickam (2013)
8. Approach
BIN : fixed size bins = (x-origin)/width
SUMMARIZE : summary stats = count, mean, stdev
SMOOTH : smoothing e.g. kernel mean, regression
VISUALISE : visualise using standard plots
9. Bigvis Package in R
Aim: To plot 100 million points in under 5 seconds.
Approach:
- Plotting using standard R libraries
- Processing done in (fast) compiled C++ code, using
Rcpp package
- Outlier removal in big data
- Smoothing to highlight trends & suppress noise
12. Plotting the Condense
p <- ggplot(BinData) + aes(carat,
price, fill=.count) + geom_tile()
Create bins = 20 and summarized using count
13. Both Points & Condensed
q <- p + geom_point(data = diamonds,
aes(fill = NULL), alpha = 0.2, colour
= "orange")
Create bins = 20, summarized using count & added base data
14. Movies dataset
ggplot(movies) + aes(length, rating)
+ geom_point(alpha = 0.2, colour =
“orange”)
130k observations e.g. length, rating of movies on IMDB
15. Let us see the outliers
title length rating
1 Matrjoschka 5700 8.5
2 The Cure for Insomnia 5220 5.9
3 The Longest Most Meaningless Movie in the World 2880 7.3
4 The Hazards of Helen 1428 6.6
5 **** 1100 6.9
20. Big Data Visualisation
● Approach: Bin - Summarize - Smooth - Visualise
● “Interactively” plot nearly 100 millions data point in-
memory for EDA in R
● Can be extend to in-database e.g. for binning
● Can be parallelised e.g. summarize on count, mean