Oscon data-2011-ted-dunning

Preliminaries
• Code is available from github:
– git@github.com:tdunning/Chapter-16.git

• EC2 instances available
• Thumb drives also available
• Email to ted.dunning@gmail.com
• Twitter @ted_dunning

A Quick Review
• What is classification?
– goes-ins: predictors
– goes-outs: target variable
• What is classifiable data?
– continuous, categorical, word-like, text-like
– uniform schema
• How do we convert from classifiable data to
feature vector?

Data Flow

Not quite so
simple

Classifiable Data
• Continuous
– A number that represents a quantity, not an id
– Blood pressure, stock price, latitude, mass
• Categorical
– One of a known, small set (color, shape)
• Word-like
– One of a possibly unknown, possibly large set
• Text-like
– Many word-like things, usually unordered

But that isn’t quite there
• Learning algorithms need feature vectors
– Have to convert from data to vector
• Can assign one location per feature
– or category
– or word
• Can assign one or more locations with hashing
– scary
– but safe on average

Let’s write some code

(cue relaxing background music)

Generating new features
• Sometimes the existing features are difficult to
use
• Restating the geometry using new reference
points may help
• Automatic reference points using k-means can
be better than manual references

More code!

(cue relaxing background music)

Integration Issues
• Feature extraction is ideal for map-reduce
– Side data adds some complexity
• Clustering works great with map-reduce
– Cluster centroids to HDFS

• Model training works better sequentially
– Need centroids in normal files
• Model deployment shouldn’t depend on HDFS

Parallel Stochastic Gradient Descent
Model

I
n
Train Average
p
sub models
u
model
t

Variational Dirichlet Assignment
Model

I
n
Gather Update
p
sufficient model
u
statistics
t

Old tricks, new dogs
Read from local disk
• Mapper from distributed cache
– Assign point to cluster
Read from
– Emit cluster id, (1, point) HDFS to local disk
• Combiner and reducer by distributed cache

– Sum counts, weighted sum of points
– Emit cluster id, (n, sum/n) Written by

• Output to HDFS map-reduce

Old tricks, new dogs
• Mapper
– Assign point to cluster Read
from
– Emit cluster id, 1, point NFS

• Combiner and reducer
– Sum counts, weighted sum of points
– Emit cluster id, n, sum/n Written by
map-reduce
• Output to HDFS
MapR FS

Modeling architecture
Side-data

Now via NFS

I
Feature
n Sequential
extraction Data
p SGD
and join
u Learning
down
t
sampling

Map-reduce

Oscon data-2011-ted-dunning

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Oscon data-2011-ted-dunning

Similar a Oscon data-2011-ted-dunning (20)

Más de Ted Dunning

Más de Ted Dunning (20)

Último

Último (20)

Oscon data-2011-ted-dunning