Data Science Salon: An Experiment on Data Science Algorithms Enabled by a Pilosa Index

HYPERGIANT2019|CONFIDENTIAL
In this workshop we will
discuss:
•Pilosa, what it does,
•How this might impact
recommendation engines
•How this might impact
association schemes
•The Geometry of data in
a Pilosa Index
•An experimental variant
of the winnow algorithms
run on a Pilosa Index.
G O A L :
2
AN EXPERIMENT
ON DATA
SCIENCE
ALGORITHMS
ENABLED BY A
PILOSA INDEX—

EXPERIENCE
+
INTELLIGENCE
A bleeding-edge dream team with
a deep understanding of the rich
panoply of advancements
available through MI.
PEOPLE
We blend strategy, design, and
development capabilities to
create experiences and deﬁne
new capabilities leveraging
Machine Intelligence.
PROCESS
Our signature, tech-agnostic
approach balances the
utilitarian and the
evolutionary.
METHOD

01: Strategy
02: Design
03: Applied Sciences
04: Development
We are comprised of digital product
strategists, data scientists, machine
learning focused engineers, creative
technologists, user experience designers
& developers for all endpoints.
5
OUR SERVICES
DIV. 0001:
SPACE AGE SOLUTIONS
/ HYPERGIANT - 2019
Strategy + Design+ Applied Science + Delivery > Technology-Agnostic Artificial Intelligence
Design
Applied Sciences
DevelopmentStrategy

M e t h o d o l o g y
H Y P E R G I A N T
OUTPUT
UserExperience
MachineIntelligence

R E A L I Z E T H A T
E V E R Y T H I N G
C O N N E C T S T O
E V E R Y T H I N G E L S E
USERS
BUSINESS
DATA
The traditional design model wherein one weighs
the user value and the business value of a given
feature is outdated. It has been replaced with a
framework in which one weighs user value,
business value, and data value. If choices are
not made that respect the value of each, the
result will be unsatisfactory to one group.

W H A T W E B E L I E V E
H Y P E R G I A N T + P I L O S A
—

/ HYPERGIANT 2019
DIV. 0001:
SPACE AGE SOLUTIONS
WHAT WE BELIEVE
9
E N T E R A D I S T R I B U T E D B I N A R Y
I N D E X : P I L O S A .
•We see Pilosa as an important technology to
a more extensible future.
•We see it as a potential solution to
connecting the quagmire of enterprise
dataset into meaningful data puddles that
are required to drive more ﬂuid data science
mechanics.
•We see it as a competitive advantage to deal
with the cost of realtime data access.
PILOSA
CHANGES THE
DIALOG AROUND
LARGE DATA
SETS, BOTH
STATIC AND IN
MOTION.

P I L O S A : C O N C E P T
—

1 1
W H A T I S P I L O S A ?
•“an open source, distributed bitmap index that
dramatically accelerates continuous analysis
across multiple, massive data sets.”
W H A T D O E S T H I S M E A N ?
•Data Lakes are a problem, especially when we are
trying to do the initial exploratory statistical
analysis of a dataset, even finding values can be
slow and tedious.
•Pilosa allows for queries to be run over the
entire dataset quickly:
•Example: 1.2B data points, 8 features, .07seconds
Hypergiant

1 2
Hypergiant
H O W D O E S I T D O T H I S ?
•Pilosa focuses on “relationships between objects
and storing those relationships in bitmaps.”
•That is: It is data feature focused.
W H A T D O E S T H I S M E A N F O R U S ?
•This allows a data scientist to search for a
combination of features over the entire data set,
quickly, finding data points with those features,
count the number of them, etc.

1 3
Hypergiant
H O W D O W E T H I N K O F I T ?
•Pilosa is a bitmap index. At the heart it is a
boolean vector for the features of a data point.
W H A T C A N T H E I N D E X D O F O R U S ?
•This allows initial multiway explorations to be
done quickly (if the data is already in an index)
•This allows for combination features to be built
and tested quickly
•Balanced data sets can be built quickly

1 4
Hypergiant
W H A T E L S E C O U L D W E D O W I T H I T ?
•The index can be treated as a dataset in itself
•It is a data set built over binary features.
W H A T A L G O R I T H M S R U N O N T H I S ?
•Recommendation Engines
•Association Rule Learning
•Winnow Algorithms
•(Others)

R E C O M M E N D A T I O N E N G I N E S
& A S S O C I A T I O N R U L E
L E A R N I N G
—

1 6
Hypergiant
A Deep Belief Network (DBN) is made of
layers of Restricted Boltzmann Machines
(RBMs). RBMs are made of two parts, a
hidden part, and a visible part, data
bounces back and forth from the visible
to the hidden probabilistically
approximated and then used to update the
probability distributions.
D E E P B E L I E F N E T W O R K S

1 7
Hypergiant
“A [Recommendation Engine] is a subclass
of information filtering system[s] that
seeks to predict the ‘rating’ or
‘preference’ a user would give to an
item.” — Wikipedia 
 
The key idea is that they do not need to
be trained on complete data.

1 8
Hypergiant
•Two features can work together:
•Did the user watch the film? (Yes/No)
•Did the user give a positive review? (Yes/No)
•In this setting (No,__) represents an incomplete
data point, no known value. [0, *]
•Similarly a richer ranking can be used:
•Did the user watch the film? (Yes/No)
•Did the user give a n-star review? (Yes/No)
•In this setting (Yes, No, No, Yes, No, No) is a 3
star review for a movie watched. [1,0,0,1,0,0]
R E P R E S E N T A T I O N S !

1 9
Quality Assurance (QA) for Recommendation Engines and
Machine Learning in general, is hard. There is a general
lack of QA tools for ML, and a lack of knowledge around
what types of errors occur and what they look like.
Using a DBN Recommendation Engine we can build out a
probability distribution for the population, based upon a
set of features, and then query, quickly, across the
population to see what which predictions differ from the
population proportion, over the remaining features.
R E C O M M E N D A T I O N M E E T R E A L I T Y :
Hypergiant

2 0
A S S O C I A T I O N R U L E L E A R N I N G
Hypergiant
“Association rule learning is a rule-based
machine learning method for discovering
interesting relations between variables in
large databases. It is intended to
identify strong rules discovered in
databases using some measures of
interestingness.” — Wikipedia
W H A T I S A S S O C I A T I O N R U L E
L E A R N I N G ?

2 1
Following the original definition the problem of
association rule mining can be defined as:
•I = {i1, i2, … , in} a set of n binary attributes,
called items in the literature, for us: features.
•D = {t1, t2, … , tm} the database or set of data points.
•A rule is given as “X implies Y” where X and Y are sets
of features.
H U H ?
Hypergiant

2 2
Hypergiant
As always there are useful metric:
•Supp(X): Proportion of the data which
contains all of X.
•Conf(Y|X): Supp(X and Y)/Supp(X) the
proportion of the data containing X
which also contains Y.
•Lift(X,Y): a measurement of independence
between X and Y.
•Conviction(Y|X) the ratio of the
frequency that X occurs without Y.
U S E F U L I D E A S

2 3
Hypergiant
The (long storied) example of a learned
association rule is the “beer and diapers”
rule for shopping between 5:00pm and
7:00pm.
This rule could be stated then to be
“[5:00-7:00] & [diapers] Implies [beer]”
Or similar, dependent upon the confidence.
E X A M P L E !

2 4
Hypergiant
Association Rule Learning has downsides in
that the number of potential rules grows
exponentially with the size of the feature
set, and most of the definitions for
‘interesting’ rules requires a large
sampling over the dataset.
These problems can be reduced through the
use of multiple queries over a Pilosa
index related to feature pairs.
D O W N S I D E S …

W I N N O W A L G O R I T H M
—

2 6
Hypergiant
What does the geometry of a discretized
dataset in a Pilosa layer look like?
There are discrete features, and
discretized continuous features. These
give it the geometry that looks like:
G E O M E T R Y !
Hn0
× Sn1
× … × Snm

2 7
Hypergiant
Hypercubes and Simplices both have good
behavior towards hyperplane separators. 
 
Note that this implies there is a good
reason to believe that a linear
separator between two classes, or
several one-vs-many linear separators,
will behave well when treating the index
itself as a dataset.
W H A T D O E S T H I S M E A N ?

2 8
Hypergiant
There are many classification algorithms
that find a linear separator between the
classes:
•SVM with a Linear Kernel
•Perceptron
•Winnow
L I N E A R S E P A R A T O R S

2 9
Hypergiant
There are several versions of the Winnow
algorithm which differ mainly in how
they treat the ‘other’ class.
They differ from perceptron algorithms
in that they are generally updated
multiplicatively rather than additively
and can only be used on binary data.

3 0
•Define two classes {0,1}, initialize weights (wi) to be
all ones, and set a threshold value (n/2 generally)
and a learning rate r (2 generally).
•For each data point (x,y) do:
•Check if:
•If true, and y=1, prediction is correct
•If true, and y=0, then set wi=0 for all xi>0
•If false, and y=0, prediction is correct
•If false, and y=1, then set wi=r*wi for all xi>0.
•Returns weights for the linear classifier.
W I N N O W 1
Hypergiant
n
∑
i=1
wixi > θ
θ

3 1
What does setting a coefficient to zero do? Once a
variable is set to zero, it can not be changed! This
allows the removal of ‘noisy’ features or features that
may indicate a non-inclusion of the class.
This reduction of the space of variables ‘winnows’ the
useful (positive) features from the rest of them.
Since all the variables are normalized, this means the
algorithm does (in some sense) dimension reduction,
variable importance, and produces a classifier.
D R O P U N I M P O R T A N T V A R I A B L E S ?
Hypergiant

3 2
•Define two classes {0,1}, initialize weights (wi) to be
all ones, and set a threshold value (n/2 generally)
and a learning rate r (2 generally).
•For each data point (x,y) do:
•Check if:
•If true, and y=1, prediction is correct
•If true, and y=0, then set wi=wi/r for all xi>0
•If false, and y=0, prediction is correct
•If false, and y=1, then set wi=r*wi for all xi>0.
•Returns weights for the linear classifier.
W I N N O W 2
Hypergiant
θ
n
∑
i=1
wixi > θ

3 3
The Pilosa demo database:
•Contains a information related to taxi
cabs in New York City,
•Over 1.2billion entries,
•Has several thousand features (I did
not play with all of them)
•Many discretized continuous variables
•Has two types of taxi: green (0) and
yellow (1). With only 45million green
taxi data points in the entire set.
D E M O D A T A !
Hypergiant

3 4
Two general approaches, and two winnow
algorithms:
•Choose a set of features (independent),
find the sub-population, and choose a
sample from it.
•Choose a set of features (independent),
find the sub-population, and assign it
to be 0 or 1 based upon if there is
more (weighted) 0 or 1 in it.
S T R A T E G I E S !
Hypergiant

3 5
From playing with the Pilosa Queries and
the results of the algorithms we learned
that the dataset is very sparse in terms
of combinations of features. This with
the 27x over sampling of the green taxi
leads to a fairly rigid separation of
the green from the yellow taxi, as the
yellow seems more distributed.
P O S T - F A C T O O B S E R V A T I O N S
Hypergiant

3 6
Literature suggests that a threshold
value of half the number of features
seems to produce good values and
convergence. Experimentation with
smaller samples (and the known geometry)
suggests that a smaller threshold would
have faster initial convergence.
T H R E S H O L D
Hypergiant

3 7
Experimentation suggests that finding a
set of features with a non-empty sub-
population is the biggest difficulty in
these approaches. Running the algorithms
on 1000 subpopulations took over
35minutes, with most of that time taken
up with many queries over the features.
T I M E B E N C H M A R K S
Hypergiant

3 8
Hypergiant

3 9
Hypergiant

4 0
Hypergiant

4 1
This was run on a small virtual machine I was given access
to by Pilosa, it did not take advantage of many of the
cloud computing resources available.
In particular Pilosa does have a TensorFlow interface,
which would have dramatically improved the computations.
That being said, the difference between running the
algorithm and just finding the features was a few seconds.
A C A V E A T A B O U T T I M E …
Hypergiant

4 2
1. See how much a TensorFlow
implementation would speed up
computation
2. Experiment with alternatives to
Winnow1 and Winnow2
3. Design a better feature sampling
method than uniform over each feature
4. Run the experiment on a different
dataset to check performance
F U T U R E S T E P S
Hypergiant

QUESTIONS FOR
TOMORROW
TODAY
™

T O M O R R O W I N G T O D A Y T M
Marc Boudria & Dr. Drew Lipman
marc@hypergiant.com
drew@hypergiant.com

Data Science Salon: An Experiment on Data Science Algorithms Enabled by a Pilosa Index

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Science Salon: An Experiment on Data Science Algorithms Enabled by a Pilosa Index

Similar a Data Science Salon: An Experiment on Data Science Algorithms Enabled by a Pilosa Index (20)

Más de Formulatedby

Más de Formulatedby (20)

Último

Último (20)

Data Science Salon: An Experiment on Data Science Algorithms Enabled by a Pilosa Index