SlideShare una empresa de Scribd logo
1 de 50
Joining The Club
A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k
Dollar Shave Club
Background on DSC
Engineering at DSC
Growth of Data Team
Show & Tell: Machine Learning Pipeline
Outline
A David and Goliath Story
Introduction
of new
members
Goliath
Engineering at DSC
u  Frontend
u  Ember.js web apps
u  iOS and Android apps
u  HTML email
u  Backend
u  Ruby on Rails web backends
u  Internal services (Ruby, Node.js, Golang, Python, Elixir)
u  Data and analytics (Python, SQL, Spark)
u  QA
u  CircleCI, SauceLabs, Jenkins
u  TestUnit, Selenium
u  IT
u  Office and warehouse IT
Engineering at DSC
highscalability.com
Data Engineering at DSC
A David and Big Data Story
Big Data
What is the barrier to entry?
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvious ROI
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvious ROI
Knowing where to start
Good Foundations
Data Engineering
u  Machine learning pipeline
u  Models served in production
u  Exploratory Analysis
u  Customer segmentation (clustering)
u  Hypothesis testing
u  Data mining
u  NLP (topic modeling)
Data Engineering
u  Maxwell + Kafka + Spark Streaming
u  Streaming data replication
u  Streaming metrics directly from the data layer
Anatomy of a Machine Learning Pipeline
Box Manager Email
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
+25% revenue per email open
Strategy
For each product, model the behavior which best distinguishes someone who
buys that product from someone who buys other products; rank a product by
the strength of the indicative behavior, when present, and rank a product
randomly otherwise
	
Model
u  Logistic Regression
u  Learns the “tipping point” between
success and failure
u  Success = “buys product X”
Design
u  Extract data from data warehouse (Redshift)
u  Join that data with hand-curated metadata (knowledge base)
u  Aggregate and pivot events by customer and discretized time
u  Generate a training set of feature vectors
u  Select features to include in the final model
u  Train and productionize the final model
def performExtraction(
extractorClass, exportName, join_table=None, join_key_col=None,
start_col=None, include_start_col=True, event_start_date=None
):
customer_id_col = extractorClass.customer_id_col
timestamp_col = extractorClass.timestamp_col
extr_agrs = extractorArgs(
customer_id_col, timestamp_col, join_table, join_key_col,
start_col, include_start_col, event_start_date
)
extractor = extractorClass(**extr_agrs)
export_path = redshiftExportPath(exportName)
return extractor.exportFromRedshift(export_path) # writes to Parquet
Extract
def exportFromRedshift(self, path):
export = self.exportDataFrame()
writeParquetWithRetry(export, path)
return sqlContext.read.parquet(path)
.persist(StorageLevel.MEMORY_AND_DISK)
def exportDataFrame(self):
query = self.generateQuery()
return sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", urlOption)
.option("query", query)
.option("tempdir", tempdir)
.load()
Extract
Domain Knowledge is Critical
The way that an expert organizes and represents facts in their domain.
u  Guides feature extraction
u  Prevents overfitting
u  Vastly superior to unsupervised feature extraction (e.g., PCA)
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
u  8,736 columns
u  2.6 million rows
Dataframes API is not optimized for extremely wide datasets
def generateQuery(self):
return """
{0}
FROM {1}
GROUP BY customer_id, {2}, {3}, {4}
""".format(
self.selectClause(), self._tempTableName,
self.bucketingExpr(), self.timestampCol, self.startDateExpr
)
def perform(self):
self.preprocessedDataFrame().registerTempTable(self._tempTableName)
return sqlContext.sql(self.generateQuery())
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
(0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0)	
(	18,	(6,16),	(1,2)	)
def perform(self):
keyedMonthlyEvents = self.dataFrame.map(self.keyRow())
pivotRDD = keyedMonthlyEvents
.combineByKey(
self.initPivot(),
self.pivotEvent(),
self.combineDicts()
)
.map(self.convertToRow())
.persist(StorageLevel.MEMORY_AND_DISK)
return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema())
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Compress, Shard, Join) and Pivot!
Featurize
u  "Explode" each customer's history into several "windows" of time.
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  Standardize each historical feature
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  Standardize each historical feature
u  Persist on S3 as text files of compressed sparse vectors
Select Features
Select Features
1.  Randomly select a set of new features to test
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
5.  Retain significant features
6.  Repeat
Production Model
u  Spark ML makes parameter tuning easy
u  Reusable modules!
brett.bevers@dollarshaveclub.com
h+p://app.jobvite.com/m?33KSgiwI

Más contenido relacionado

La actualidad más candente

GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰoggers
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystTakuya UESHIN
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 

La actualidad más candente (20)

GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Spark etl
Spark etlSpark etl
Spark etl
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 

Destacado

Destacado (8)

Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Case Study Netflix
Case Study NetflixCase Study Netflix
Case Study Netflix
 
Netflix Case Study
Netflix Case StudyNetflix Case Study
Netflix Case Study
 
Netflix marketing plan
Netflix marketing plan Netflix marketing plan
Netflix marketing plan
 
Netflix case study
Netflix case studyNetflix case study
Netflix case study
 
Netflix Business Model & Strategy
Netflix Business Model & StrategyNetflix Business Model & Strategy
Netflix Business Model & Strategy
 
Netflix Case Study
Netflix Case StudyNetflix Case Study
Netflix Case Study
 
Culture
CultureCulture
Culture
 

Similar a Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupRussell Jurney
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rockmartincronje
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaDsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaUgo Matrangolo
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsFujio Turner
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTeamstudio
 
Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Mahmoud Hamed Mahmoud
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 

Similar a Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club (20)

Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
 
Coding Naked 2023
Coding Naked 2023Coding Naked 2023
Coding Naked 2023
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaDsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & Startups
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java Application
 
Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development
 
Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Ajax-Tutorial
Ajax-TutorialAjax-Tutorial
Ajax-Tutorial
 

Más de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Más de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

  • 1. Joining The Club A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k Dollar Shave Club
  • 2. Background on DSC Engineering at DSC Growth of Data Team Show & Tell: Machine Learning Pipeline Outline
  • 3. A David and Goliath Story Introduction
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Engineering at DSC u  Frontend u  Ember.js web apps u  iOS and Android apps u  HTML email u  Backend u  Ruby on Rails web backends u  Internal services (Ruby, Node.js, Golang, Python, Elixir) u  Data and analytics (Python, SQL, Spark) u  QA u  CircleCI, SauceLabs, Jenkins u  TestUnit, Selenium u  IT u  Office and warehouse IT
  • 12. Data Engineering at DSC A David and Big Data Story
  • 13. Big Data What is the barrier to entry?
  • 14. Big Data What is the barrier to entry? u  Requires a different set of capabilities
  • 15. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI
  • 16. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI Knowing where to start
  • 18. Data Engineering u  Machine learning pipeline u  Models served in production u  Exploratory Analysis u  Customer segmentation (clustering) u  Hypothesis testing u  Data mining u  NLP (topic modeling)
  • 19. Data Engineering u  Maxwell + Kafka + Spark Streaming u  Streaming data replication u  Streaming metrics directly from the data layer
  • 20. Anatomy of a Machine Learning Pipeline
  • 22. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box
  • 23. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box +25% revenue per email open
  • 24. Strategy For each product, model the behavior which best distinguishes someone who buys that product from someone who buys other products; rank a product by the strength of the indicative behavior, when present, and rank a product randomly otherwise Model u  Logistic Regression u  Learns the “tipping point” between success and failure u  Success = “buys product X”
  • 25. Design u  Extract data from data warehouse (Redshift) u  Join that data with hand-curated metadata (knowledge base) u  Aggregate and pivot events by customer and discretized time u  Generate a training set of feature vectors u  Select features to include in the final model u  Train and productionize the final model
  • 26. def performExtraction( extractorClass, exportName, join_table=None, join_key_col=None, start_col=None, include_start_col=True, event_start_date=None ): customer_id_col = extractorClass.customer_id_col timestamp_col = extractorClass.timestamp_col extr_agrs = extractorArgs( customer_id_col, timestamp_col, join_table, join_key_col, start_col, include_start_col, event_start_date ) extractor = extractorClass(**extr_agrs) export_path = redshiftExportPath(exportName) return extractor.exportFromRedshift(export_path) # writes to Parquet Extract
  • 27. def exportFromRedshift(self, path): export = self.exportDataFrame() writeParquetWithRetry(export, path) return sqlContext.read.parquet(path) .persist(StorageLevel.MEMORY_AND_DISK) def exportDataFrame(self): query = self.generateQuery() return sqlContext.read .format("com.databricks.spark.redshift") .option("url", urlOption) .option("query", query) .option("tempdir", tempdir) .load() Extract
  • 28.
  • 29. Domain Knowledge is Critical The way that an expert organizes and represents facts in their domain. u  Guides feature extraction u  Prevents overfitting u  Vastly superior to unsupervised feature extraction (e.g., PCA)
  • 30.
  • 31. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph
  • 32. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph u  8,736 columns u  2.6 million rows Dataframes API is not optimized for extremely wide datasets
  • 33. def generateQuery(self): return """ {0} FROM {1} GROUP BY customer_id, {2}, {3}, {4} """.format( self.selectClause(), self._tempTableName, self.bucketingExpr(), self.timestampCol, self.startDateExpr ) def perform(self): self.preprocessedDataFrame().registerTempTable(self._tempTableName) return sqlContext.sql(self.generateQuery()) Aggregate (Shard, Compress, Join) and Pivot!
  • 34. Aggregate (Shard, Compress, Join) and Pivot!
  • 35. Aggregate (Shard, Compress, Join) and Pivot! (0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0) ( 18, (6,16), (1,2) )
  • 36. def perform(self): keyedMonthlyEvents = self.dataFrame.map(self.keyRow()) pivotRDD = keyedMonthlyEvents .combineByKey( self.initPivot(), self.pivotEvent(), self.combineDicts() ) .map(self.convertToRow()) .persist(StorageLevel.MEMORY_AND_DISK) return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema()) Aggregate (Shard, Compress, Join) and Pivot!
  • 37. Aggregate (Compress, Shard, Join) and Pivot!
  • 38. Featurize u  "Explode" each customer's history into several "windows" of time.
  • 39. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets
  • 40. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature
  • 41. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature u  Persist on S3 as text files of compressed sparse vectors
  • 43. Select Features 1.  Randomly select a set of new features to test
  • 44. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features
  • 45. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model
  • 46. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  • 47. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  • 48. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature 5.  Retain significant features 6.  Repeat
  • 49. Production Model u  Spark ML makes parameter tuning easy u  Reusable modules!