Visualising Big Data

Amit Kapoor
@amitkaps
Visualising
Big Data

Visualise Million
Data Points
x <- rnorm(1000000, mean=0, sd=2)
y <- rnorm(1000000, mean=0, sd=2)
xy <- data.frame(x,y)
Same order as the
Number of Pixels
on my MacBook Air
1400 x 900
Data

Data Sample
Sampling can be
effective (with
overweighting
unusual values)
Require multiple
plots or careful
tuning parameters

Data Sample
Model
Models are great as
they scale nicely.
But, visualisation is
required as
“I don’t know, what I
don’t know.”

Data Sample
ModelBinning
Binning can solve a
lot of these
challenges
“Bin - Summarize -
Smooth: A framework
for visualising big data” -
Hadley Wickam (2013)

“Visualising big data
is the process of creating
generalized histograms”

Approach
BIN : fixed size bins = (x-origin)/width
SUMMARIZE : summary stats = count, mean, stdev
SMOOTH : smoothing e.g. kernel mean, regression
VISUALISE : visualise using standard plots

Bigvis Package in R
Aim: To plot 100 million points in under 5 seconds.
Approach:
- Plotting using standard R libraries
- Processing done in (fast) compiled C++ code, using
Rcpp package
- Outlier removal in big data
- Smoothing to highlight trends & suppress noise

Diamonds dataset
ggplot(diamonds) + aes(carat, price)
+ geom_point(alpha = 0.2, colour =
“orange”)
50k observations e.g. price, carat of diamonds

Condense (bin + summarise)
library(bigvis)
library(ggplot2)
Nbin <- 20
BinData <- with(diamonds, condense(
bin(carat, find_width(carat,Nbin)),
bin(price, find_width(price,Nbin)))

Plotting the Condense
p <- ggplot(BinData) + aes(carat,
price, fill=.count) + geom_tile()
Create bins = 20 and summarized using count

Both Points & Condensed
q <- p + geom_point(data = diamonds,
aes(fill = NULL), alpha = 0.2, colour
= "orange")
Create bins = 20, summarized using count & added base data

Movies dataset
ggplot(movies) + aes(length, rating)
+ geom_point(alpha = 0.2, colour =
“orange”)
130k observations e.g. length, rating of movies on IMDB

Let us see the outliers
title length rating
1 Matrjoschka 5700 8.5
2 The Cure for Insomnia 5220 5.9
3 The Longest Most Meaningless Movie in the World 2880 7.3
4 The Hazards of Helen 1428 6.6
5 **** 1100 6.9

Condense (bin + summarise)
library(bigvis)
library(ggplot2)
Nbin <- 1e4
BinData <- with(movies, condense(
bin(length, find_width(length,Nbin)),
bin(rating, find_width(rating,Nbin)))

Condesed Plot
p <- ggplot(BinData) + aes(length,
rating, fill=.count) + geom_tile()
Create bins = 10000 and summarized using count

Remove Outliers
p %>% peel(BinData)
Create bins = 10000, summarize count & peel 1% outlier

Smoothing
smoothBinData <- smooth(peel
(binData), h=c(20, 1))
autoplot(smoothBinData)
Create bins = 20, summarize count, peel 1% outlier & smooth

Big Data Visualisation
● Approach: Bin - Summarize - Smooth - Visualise
● “Interactively” plot nearly 100 millions data point in-
memory for EDA in R
● Can be extend to in-database e.g. for binning
● Can be parallelised e.g. summarize on count, mean

Amit Kapoor
@amitkaps
amitkaps.com
narrativeviz.com
Data
Visual
Story
*

Visualising Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Visualising Big Data

Similar to Visualising Big Data (20)

More from Amit Kapoor

More from Amit Kapoor (11)

Recently uploaded

Recently uploaded (20)

Visualising Big Data