Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

Joining The Club
A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k
Dollar Shave Club

Background on DSC
Engineering at DSC
Growth of Data Team
Show & Tell: Machine Learning Pipeline
Outline

A David and Goliath Story
Introduction

Engineering at DSC
u  Frontend
u  Ember.js web apps
u  iOS and Android apps
u  HTML email
u  Backend
u  Ruby on Rails web backends
u  Internal services (Ruby, Node.js, Golang, Python, Elixir)
u  Data and analytics (Python, SQL, Spark)
u  QA
u  CircleCI, SauceLabs, Jenkins
u  TestUnit, Selenium
u  IT
u  Office and warehouse IT

Engineering at DSC
highscalability.com

Data Engineering at DSC
A David and Big Data Story

Big Data
What is the barrier to entry?

Big Data
u  Requires a different set of capabilities

Big Data
u  Investing resources without an obvious ROI

Big Data
u  Investing resources without an obvious ROI
Knowing where to start

Data Engineering
u  Machine learning pipeline
u  Models served in production
u  Exploratory Analysis
u  Customer segmentation (clustering)
u  Hypothesis testing
u  Data mining
u  NLP (topic modeling)

Data Engineering
u  Maxwell + Kafka + Spark Streaming
u  Streaming data replication
u  Streaming metrics directly from the data layer

Anatomy of a Machine Learning Pipeline

Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize proﬁt
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box

Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize proﬁt
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
+25% revenue per email open

Strategy
For each product, model the behavior which best distinguishes someone who
buys that product from someone who buys other products; rank a product by
the strength of the indicative behavior, when present, and rank a product
randomly otherwise

Model
u  Logistic Regression
u  Learns the “tipping point” between
success and failure
u  Success = “buys product X”

Design
u  Extract data from data warehouse (Redshift)
u  Join that data with hand-curated metadata (knowledge base)
u  Aggregate and pivot events by customer and discretized time
u  Generate a training set of feature vectors
u  Select features to include in the ﬁnal model
u  Train and productionize the ﬁnal model

def performExtraction(
extractorClass, exportName, join_table=None, join_key_col=None,
start_col=None, include_start_col=True, event_start_date=None
):
customer_id_col = extractorClass.customer_id_col
timestamp_col = extractorClass.timestamp_col
extr_agrs = extractorArgs(
customer_id_col, timestamp_col, join_table, join_key_col,
start_col, include_start_col, event_start_date
)
extractor = extractorClass(**extr_agrs)
export_path = redshiftExportPath(exportName)
return extractor.exportFromRedshift(export_path) # writes to Parquet
Extract

def exportFromRedshift(self, path):
export = self.exportDataFrame()
writeParquetWithRetry(export, path)
return sqlContext.read.parquet(path)
.persist(StorageLevel.MEMORY_AND_DISK)
def exportDataFrame(self):
query = self.generateQuery()
return sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", urlOption)
.option("query", query)
.option("tempdir", tempdir)
.load()
Extract

Domain Knowledge is Critical
The way that an expert organizes and represents facts in their domain.
u  Guides feature extraction
u  Prevents overﬁtting
u  Vastly superior to unsupervised feature extraction (e.g., PCA)

Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph

This dance is hard to choreograph
u  8,736 columns
u  2.6 million rows
Dataframes API is not optimized for extremely wide datasets

def generateQuery(self):
return """
{0}
FROM {1}
GROUP BY customer_id, {2}, {3}, {4}
""".format(
self.selectClause(), self._tempTableName,
self.bucketingExpr(), self.timestampCol, self.startDateExpr
)
def perform(self):
self.preprocessedDataFrame().registerTempTable(self._tempTableName)
return sqlContext.sql(self.generateQuery())

(0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0)
( 18, (6,16), (1,2) )

def perform(self):
keyedMonthlyEvents = self.dataFrame.map(self.keyRow())
pivotRDD = keyedMonthlyEvents
.combineByKey(
self.initPivot(),
self.pivotEvent(),
self.combineDicts()
)
.map(self.convertToRow())
.persist(StorageLevel.MEMORY_AND_DISK)
return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema())

Aggregate (Compress, Shard, Join) and Pivot!

Featurize
u  "Explode" each customer's history into several "windows" of time.

Featurize
u  Deﬁne one or more prediction targets

Featurize
u  Standardize each historical feature

Featurize
u  Standardize each historical feature
u  Persist on S3 as text ﬁles of compressed sparse vectors

Select Features
1.  Randomly select a set of new features to test

Select Features
2.  Derive training set for new features + previously selected features

Select Features
3.  Train model

Select Features
3.  Train model
4.  Calculate the p-value for each feature

Select Features
3.  Train model
4.  Calculate the p-value for each feature
5.  Retain signiﬁcant features
6.  Repeat

Production Model
u  Spark ML makes parameter tuning easy
u  Reusable modules!

brett.bevers@dollarshaveclub.com
h+p://app.jobvite.com/m?33KSgiwI

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

Similar a Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club (20)

Más de Data Con LA

Más de Data Con LA (20)

Último

Último (20)

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club