SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Scaling ML Feature Engineering with
Apache Spark at Facebook
Cheng Su & Sameer Agarwal
Facebook Inc.
About Us
▪ Sameer Agarwal
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Committer (Spark Core/SQL)
▪ Previously at Databricks and UC Berkeley
▪ Cheng Su
▪ Software Engineer at Facebook (Data Platform Team)
▪ Apache Spark Contributor (Spark SQL)
▪ Previously worked on Hive & Hadoop at Facebook
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Machine Learning at Facebook1
Data Features Training Inference
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
PredictionsModel
Machine Learning at Facebook1
Data Features Training
Inferenc
e
PredictionsModel
This Talk
1. Data Layouts (Tables and Physical Encodings)
2. Feature Reaping
3. Feature Injection
1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Data Layouts (Tables and Physical Encodings)
Training Data Table
- Table to store data for ML training
- Huge volume (multiple PBs/day)
userId: BIGINT
adId: BIGINT
features: MAP<INT, DOUBLE>
…
Feature Tables
- Tables to store all possible features (many of them aren’t promoted in training data
table)
- Smaller volume (low-100s of TBs/ day)
userId: BIGINT
features: MAP<INT, DOUBLE>
…
gender likes …
age
state
country
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
Data Layouts (Tables and Physical Encodings)
1. Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
2. Feature Reaping: Removing unnecessary features (id and value) from training
data. Think “deleting existing keys from a map”
gender likes …
age
state
country
Feature Injection
Feature Reaping
Training Data Table Feature Tables
Background: Apache ORC
▪ Stripe (Row Group)
▪ Rows are divided into multiple groups
▪ Stream
▪ Columns are stored separately
▪ PRESET, DATA, LENGTH stream for each column
▪ Different encoding and compression
strategy for each column
How is a Feature Map Stored in ORC?
▪ Key and value are stored as separate streams/columns
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- Key stream: k1, k1, k2, k1, k2
- Value stream: v1, v2, v3 v5, v4
▪ Each stream is individually encoded and compressed
▪ Reading or deleting specific keys (i.e., feature reaping) becomes a
problem
- Need to read (decompress and decode) and re-write ALL keys and values
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
INT
col 0, node: 2
DOUBLE
col 0, node: 3
col 0, node: 1
k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
Introducing: ORC Flattened Map
▪ Values that correspond to each key are stored as separate streams
- Raw Data
- Row 1: (k1, v1)
- Row 2: (k1, v2), (k2, v3)
- Row 3: (k1, v5), (k2, v4)
- Streams
- k1 stream: v1, v2, v5
- k2 stream: NULL, v3, v4
- Stores map like a struct
▪ Each key’s value stream is individually encoded and compressed
▪ Reading or deleting specific keys becomes very efficient!
features: MAP<INT, DOUBLE>
STRUCT
col -1, node: 0
MAP
Value (k1)
col 0, node: 3, seq: 1
Value (k2)
col 0, node: 1
v1, v2, v5 NULL, v3, v4
col 0, node: 3, seq: 2
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Feature Reaping
▪ Feature Reaping frameworks generate Spark
SQL queries based on table name, partitions,
and reaped feature ids
▪ For each reaping SQL query, Spark has special
customization in query planner, execution
engine and commit protocol
▪ Each Spark task launches a SQL transform
process, and uses native/C++ binary to do
efficient flat map operations
SparkJavaExecutor
c++ reaper
transform
SparkJavaExecutor
c++ reaper
transform
training_data_v1_1.orc training_data_v1_2.orc
training_data_v2_1.orc training_data_v2_2.orc
Performance
0
10000
20000
30000
40000
50000
20PB
CPU(days)
CPU cost for flat map vs naïve solution*
(14x better on 20PB data)
Naïve Flat Map
0
500000
1000000
1500000
2000000
300PB
CPU(days)
CPU cost for flat map vs naïve solution*
(89x better on 300PB data)
Naïve Flat Map
▪ Case 1
▪ Input data size: 20PB
▪ # of reaped features: 200
▪ # total features: ~1k
▪ Case 2
▪ Input data size: 300PB
▪ # of reaped features: 200
▪ # total features: ~10k
*Naïve solution: A Spark SQL query to re-write all data
with removing required features from map column with
UDF/Lambda.
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Training Data Table Feature Tables
Data Layouts (Tables and Physical Encodings)
Feature Injection: Extending base features with new/experimental features to
improve model performance. Think “adding new keys to a map”
Requirements:
1. Allow fast ML training experimentation
2. Save storage space
gender likes …
age
state
country
Feature Injection
Introducing: Aligned Tables!
Training Data Table Feature Tables
Introducing: Aligned Table
▪ Intuition: Store the output of the join between the training table
and the feature table in 2 separate row-by-row aligned tables
▪ An aligned table is a table that has the same layout as the original
table
- Same number of files
- Same file names
- Same number of rows (and their order) in each file.
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Query Plan for Aligned Table
col -1, node: 0
col 0, node: 3, seq: 1
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
id feature
1 f1
2 f2
4 f4
6 f6
training table
feature table
file_1.orc file_2.orc
file_1.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table
Scan
(training table)
Scan
(feature table)
Project
(…, file_name,
row_order)
Join
(LEFT OUTER)
Shuffle
(file_name)
Sort
(file_name,
row_order)
InsertIntoHadoopFsRelationComman
d (Aligned Table)
Reading Aligned Tables
▪ FB-ORC aligned table row-by-row merge reader
▪ Read each aligned table file with the corresponding original table file in one task
▪ Read row-by-row according to row order
▪ Merge aligned table columns per row with corresponding original table columns per row
id features
1 ...
2 ...
5 ...
id features
3 ...
4 ...
6 ...
training table
file_1.orc file_2.orc
id feature
1 f1
2 f2
5 NULL
id feature
3 NULL
4 f4
6 f6
file_1.orc file_2.orc
aligned table aligned tabletraining table
reader task 1 reader task 2
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
End to End Performance
1. Baseline 1: Left Outer Join
▪ LEFT OUTER join that materializes new columns/sub-fields into training table
▪ Cons: Reads and overwrites ALL columns of training table every time
2. Baseline 2: Lookup Hash Join
▪ Load feature table(s) into a distributed hash table (Laser1)
▪ Lookup hash join while reading training table
▪ Cons:
▪ Adds an external dependency on a distributed hash table; impacts latency, reliability &
efficiency
▪ Needs a lookup hash join each time the training table is read
1
Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp-
content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details
Aligned Tables vs Left Outer Join
Compute Savings: 15x
Storage Savings: 30x
Aligned Tables vs Lookup Hash
Join
Compute Savings: 1.5x
Storage Savings: 2.1x
Agenda
▪ Machine Learning at Facebook
▪ Data Layouts (Tables and Physical Encodings)
▪ Feature Reaping
▪ Feature Injection
▪ Future Work
Future Work
▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs)
▪ Onboarding more ML use cases to Spark
▪ Batch Inference
▪ Training
MERGE training_table
PARTITION(ds='2020-10-28', pipeline='...', ts)
USING (
SELECT ...) AS f
ON features[0][0] = f.key
WHEN MATCHED THEN UPDATE
SET float_features = MAP_CONCAT(float_features,
f.densefeatures)
Thank you!
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Más contenido relacionado

La actualidad más candente

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with DebeziumMike Fowler
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
 
The Feature Store in Hopsworks
The Feature Store in HopsworksThe Feature Store in Hopsworks
The Feature Store in HopsworksJim Dowling
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
 

La actualidad más candente (20)

Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Migrating with Debezium
Migrating with DebeziumMigrating with Debezium
Migrating with Debezium
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
The Feature Store in Hopsworks
The Feature Store in HopsworksThe Feature Store in Hopsworks
The Feature Store in Hopsworks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 

Similar a Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Databricks
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 

Similar a Scaling Machine Learning Feature Engineering in Apache Spark at Facebook (20)

Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Meetup talk
Meetup talkMeetup talk
Meetup talk
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Último (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Scaling Machine Learning Feature Engineering in Apache Spark at Facebook

  • 1. Scaling ML Feature Engineering with Apache Spark at Facebook Cheng Su & Sameer Agarwal Facebook Inc.
  • 2. About Us ▪ Sameer Agarwal ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Committer (Spark Core/SQL) ▪ Previously at Databricks and UC Berkeley ▪ Cheng Su ▪ Software Engineer at Facebook (Data Platform Team) ▪ Apache Spark Contributor (Spark SQL) ▪ Previously worked on Hive & Hadoop at Facebook
  • 3. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 4. Machine Learning at Facebook1 Data Features Training Inference 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018 PredictionsModel
  • 5. Machine Learning at Facebook1 Data Features Training Inferenc e PredictionsModel This Talk 1. Data Layouts (Tables and Physical Encodings) 2. Feature Reaping 3. Feature Injection 1Hazelwood et al., Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In HPCA 2018
  • 6. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 7. Data Layouts (Tables and Physical Encodings) Training Data Table - Table to store data for ML training - Huge volume (multiple PBs/day) userId: BIGINT adId: BIGINT features: MAP<INT, DOUBLE> … Feature Tables - Tables to store all possible features (many of them aren’t promoted in training data table) - Smaller volume (low-100s of TBs/ day) userId: BIGINT features: MAP<INT, DOUBLE> … gender likes … age state country
  • 8. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” gender likes … age state country Feature Injection Training Data Table Feature Tables
  • 9. Data Layouts (Tables and Physical Encodings) 1. Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” 2. Feature Reaping: Removing unnecessary features (id and value) from training data. Think “deleting existing keys from a map” gender likes … age state country Feature Injection Feature Reaping Training Data Table Feature Tables
  • 10. Background: Apache ORC ▪ Stripe (Row Group) ▪ Rows are divided into multiple groups ▪ Stream ▪ Columns are stored separately ▪ PRESET, DATA, LENGTH stream for each column ▪ Different encoding and compression strategy for each column
  • 11. How is a Feature Map Stored in ORC? ▪ Key and value are stored as separate streams/columns - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - Key stream: k1, k1, k2, k1, k2 - Value stream: v1, v2, v3 v5, v4 ▪ Each stream is individually encoded and compressed ▪ Reading or deleting specific keys (i.e., feature reaping) becomes a problem - Need to read (decompress and decode) and re-write ALL keys and values features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP INT col 0, node: 2 DOUBLE col 0, node: 3 col 0, node: 1 k1, k1, k2, k1, k2 v1, v2, v3 v5, v4
  • 12. Introducing: ORC Flattened Map ▪ Values that correspond to each key are stored as separate streams - Raw Data - Row 1: (k1, v1) - Row 2: (k1, v2), (k2, v3) - Row 3: (k1, v5), (k2, v4) - Streams - k1 stream: v1, v2, v5 - k2 stream: NULL, v3, v4 - Stores map like a struct ▪ Each key’s value stream is individually encoded and compressed ▪ Reading or deleting specific keys becomes very efficient! features: MAP<INT, DOUBLE> STRUCT col -1, node: 0 MAP Value (k1) col 0, node: 3, seq: 1 Value (k2) col 0, node: 1 v1, v2, v5 NULL, v3, v4 col 0, node: 3, seq: 2
  • 13. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 14. Feature Reaping ▪ Feature Reaping frameworks generate Spark SQL queries based on table name, partitions, and reaped feature ids ▪ For each reaping SQL query, Spark has special customization in query planner, execution engine and commit protocol ▪ Each Spark task launches a SQL transform process, and uses native/C++ binary to do efficient flat map operations SparkJavaExecutor c++ reaper transform SparkJavaExecutor c++ reaper transform training_data_v1_1.orc training_data_v1_2.orc training_data_v2_1.orc training_data_v2_2.orc
  • 15. Performance 0 10000 20000 30000 40000 50000 20PB CPU(days) CPU cost for flat map vs naïve solution* (14x better on 20PB data) Naïve Flat Map 0 500000 1000000 1500000 2000000 300PB CPU(days) CPU cost for flat map vs naïve solution* (89x better on 300PB data) Naïve Flat Map ▪ Case 1 ▪ Input data size: 20PB ▪ # of reaped features: 200 ▪ # total features: ~1k ▪ Case 2 ▪ Input data size: 300PB ▪ # of reaped features: 200 ▪ # total features: ~10k *Naïve solution: A Spark SQL query to re-write all data with removing required features from map column with UDF/Lambda.
  • 16. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 17. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Training Data Table Feature Tables
  • 18. Data Layouts (Tables and Physical Encodings) Feature Injection: Extending base features with new/experimental features to improve model performance. Think “adding new keys to a map” Requirements: 1. Allow fast ML training experimentation 2. Save storage space gender likes … age state country Feature Injection Introducing: Aligned Tables! Training Data Table Feature Tables
  • 19. Introducing: Aligned Table ▪ Intuition: Store the output of the join between the training table and the feature table in 2 separate row-by-row aligned tables ▪ An aligned table is a table that has the same layout as the original table - Same number of files - Same file names - Same number of rows (and their order) in each file. col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table
  • 20. Query Plan for Aligned Table col -1, node: 0 col 0, node: 3, seq: 1 id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... id feature 1 f1 2 f2 4 f4 6 f6 training table feature table file_1.orc file_2.orc file_1.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table Scan (training table) Scan (feature table) Project (…, file_name, row_order) Join (LEFT OUTER) Shuffle (file_name) Sort (file_name, row_order) InsertIntoHadoopFsRelationComman d (Aligned Table)
  • 21. Reading Aligned Tables ▪ FB-ORC aligned table row-by-row merge reader ▪ Read each aligned table file with the corresponding original table file in one task ▪ Read row-by-row according to row order ▪ Merge aligned table columns per row with corresponding original table columns per row id features 1 ... 2 ... 5 ... id features 3 ... 4 ... 6 ... training table file_1.orc file_2.orc id feature 1 f1 2 f2 5 NULL id feature 3 NULL 4 f4 6 f6 file_1.orc file_2.orc aligned table aligned tabletraining table reader task 1 reader task 2
  • 22. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time
  • 23. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  • 24. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x
  • 25. End to End Performance 1. Baseline 1: Left Outer Join ▪ LEFT OUTER join that materializes new columns/sub-fields into training table ▪ Cons: Reads and overwrites ALL columns of training table every time 2. Baseline 2: Lookup Hash Join ▪ Load feature table(s) into a distributed hash table (Laser1) ▪ Lookup hash join while reading training table ▪ Cons: ▪ Adds an external dependency on a distributed hash table; impacts latency, reliability & efficiency ▪ Needs a lookup hash join each time the training table is read 1 Laser: a distributed hash table service built on top of RocksDB, see https://research.fb.com/wp- content/uploads/2016/11/realtime_data_processing_at_facebook.pdf for details Aligned Tables vs Left Outer Join Compute Savings: 15x Storage Savings: 30x Aligned Tables vs Lookup Hash Join Compute Savings: 1.5x Storage Savings: 2.1x
  • 26. Agenda ▪ Machine Learning at Facebook ▪ Data Layouts (Tables and Physical Encodings) ▪ Feature Reaping ▪ Feature Injection ▪ Future Work
  • 27. Future Work ▪ Better Spark SQL interface for ML primitives (e.g., UPSERTs) ▪ Onboarding more ML use cases to Spark ▪ Batch Inference ▪ Training MERGE training_table PARTITION(ds='2020-10-28', pipeline='...', ts) USING ( SELECT ...) AS f ON features[0][0] = f.key WHEN MATCHED THEN UPDATE SET float_features = MAP_CONCAT(float_features, f.densefeatures)
  • 28. Thank you! Your feedback is important to us. Don’t forget to rate and review the sessions.