Building, managing, and maintaining thousands of features across thousands of models. Building features can be repetitive, tedious and extremely challenging to scale. We will explore the ‘Feature Factory’ built at Databricks and implemented at several clients and the processes that are imperative for the democratization of feature development and deployment. The feature factory allows consumers to ensure repetitive feature creation, simplifies scoring and enables massive scalability through feature multiplication.
6. Measure
FeatureMetric
Measure vs Metric vs Feature
6#UnifiedDataAnalytics #SparkAISummit
An individual measurable property or
characteristic of an observation.
A raw, aggregated or altered metric
that can provide predictive power in
pattern recognition, classification,
and regression.
Numbers or values that can
be summed and/or
averaged, such as sales,
leads, distances, durations,
temperatures, and weight
A quantifiable measure that
is used to track and assess
the status of a specific
process
7. Measure
31
FeatureMetric
Measure vs Metric vs Feature
7#UnifiedDataAnalytics #SparkAISummit
An individual measurable property or
characteristic of an observation.
A raw, aggregated or altered metric
that can provide predictive power in
pattern recognition, classification,
and regression.
Numbers or values that can
be summed and/or
averaged, such as sales,
leads, distances, durations,
temperatures, and weight
A quantifiable measure that
is used to track and assess
the status of a specific
process
8. Measure
31
Feature
Metric
+31
Country
Code
Measure vs Metric vs Feature
8#UnifiedDataAnalytics #SparkAISummit
An individual measurable property or
characteristic of an observation.
A raw, aggregated or altered metric
that can provide predictive power in
pattern recognition, classification,
and regression.
Numbers or values that can
be summed and/or
averaged, such as sales,
leads, distances, durations,
temperatures, and weight
A quantifiable measure that
is used to track and assess
the status of a specific
process
9. Measure
31
Feature
.002428571
Metric
+31
Country
Code
Measure vs Metric vs Feature
9#UnifiedDataAnalytics #SparkAISummit
An individual measurable property or
characteristic of an observation.
A raw, aggregated or altered metric
that can provide predictive power in
pattern recognition, classification,
and regression.
Numbers or values that can
be summed and/or
averaged, such as sales,
leads, distances, durations,
temperatures, and weight
A quantifiable measure that
is used to track and assess
the status of a specific
process
10. Identify Metrics & Features
Write code that writes code
Join, Union, Agg
Optimize
How It Goes
Identify data scope and scale
Understand target if applicable
Scope down to relevant data
Scope up to include more data
Explore available data
Understand data models
Understand business rules
10#UnifiedDataAnalytics #SparkAISummit
Modeling Data Filtering
Twisting (Sales X Time Ranges)
Tweaking (Scaling/Binning)
Clustering/PCA/Correlation
Pearson/Outlier
Model Stacking
Data Leaks
Model Tuning
Evaluation
Data ScientistData Engineer
11. Feature Factory
Identify data scope and scale
Understand target if applicable
Scope down to relevant data
Scope up to include more data
Explore available data
Understand data models
Understand business rules
11#UnifiedDataAnalytics #SparkAISummit
Modeling Data Filtering
Twisting (Sales X Time Ranges)
Tweaking (Scaling/Binning)
Clustering/PCA/Correlation
Pearson/Outlier
Model Stacking
Data Leaks
Model Tuning
Evaluation
Data ScientistData Engineer
Identify Metrics & Features
Write code that writes code
Join, Union, Agg
Optimize
13. Why A Feature Factory
Rapidly prototype and deliver 1000s of features
Build them all and let science decide
13#UnifiedDataAnalytics #SparkAISummit
Univariate Selection Algorithms
Feature Importance Models (XGBoost)
Correlation Matrices
High-Dimensional PCA
14. Why A Feature Factory
Feature reusability
Consistent logic (joins and formulas)
Optimized feature generation
Process
Documentation – Finally!
Scalable (10K+ features)
14#UnifiedDataAnalytics #SparkAISummit
15. What Is A Feature Factory
Code Base - APIs
Accelerator – Configurable – Not OEM
Extensible & Customizable…Incomplete
15#UnifiedDataAnalytics #SparkAISummit
16. How It Works
Land the scaffolding
Gut the demo
Structure, Configure your Concepts
Initialize your data and your metrics
16#UnifiedDataAnalytics #SparkAISummit
24. Highlights - Multipliers
24#UnifiedDataAnalytics #SparkAISummit
Feature Families
– Sales
– Customer
– Weather
– Geo
Multipliers
– Time
– Categorical
– Trends
Base Metrics
(Sales/Customer)
Categorical
(Category)
Time Window
(Multiplier)
Base Metric
(Weather/Geo)
Total_Sales_6m_Sunny_Category-MensShoes
Total_Customers_3m_GeoRange_CheckoutMethod-Self
25. Highlights - Multipliers
25#UnifiedDataAnalytics #SparkAISummit
Sales Metrics (8)
Time Windows
1m 3m 6m 12m
1w 2w 3w 4w
Customer Metrics (8)
Categorical Dims
Item Category (8)
Demographics (12)
8 * 9 * 8 * 8 * 12 = 55,296 possible features & < 20 lines of code
Common Example
8 sales metrics * 4 time windows * 5 dims with avg of 12 distincts
8 * 4 * 5 * 12 = 1,920 features
Send to feature importance/selection process and pick top n
27. Highlights – Canned Data
27#UnifiedDataAnalytics #SparkAISummit
Where is relevant Data?
Just browse the data related to the concept
28.
29.
30. Highlights – Canned Data
30#UnifiedDataAnalytics #SparkAISummit
Where is relevant Data?
Just browse the data related to the concept
31. Highlights – Date/Time Manager
31#UnifiedDataAnalytics #SparkAISummit
Unified Time Definition
Define it once and be done with it
Simplified Filtering
Time-Based Splits (ML)
33. Highlights – Easy Docs
33#UnifiedDataAnalytics #SparkAISummit
Add the docs to the Metrics
Add the docs to the multipliers
Features are now self-documenting