SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
Modularized ETL
Writing with Spark
Neelesh Salian
Software Engineer - Stitch Fix
May 27, 2021
whoami
Neelesh Salian
Software Engineer - Data Platform
Agenda
▪ What is Stitch Fix?
▪ Apache Spark @ Stitch Fix
▪ Spark Writer Modules
▪ Learnings & Future Work
What is Stitch Fix?
What does the company do?
Stitch Fix is a personalized styling service
Shop at your personal curated store. Check out what you like.
Data Science is behind everything we do
algorithms-tour.stitchfix.com
• Algorithms org
• 145+ Data Scientists and Platform engineers
• 3 main verticals + platform
Apache Spark @ Stitch Fix
How we use Spark in our teams?
Spark @ Stitch Fix - History and Current State
▪ Spark was introduced to enhance and
scale ETL capabilities (circa 2016)
▪ Starting version: 1.2.x
▪ Spark SQL was the dominant use
case
▪ Used for reading and writing data into
the warehouse as Hive Tables.
▪ Current Version: 2.4.x,
3.1.x [ prototyping]
▪ For all ETL reads and writes,
production and test
▪ Spark serves regular pyspark,sql and
scala jobs, notebooks &
pandas-based readers - writers
▪ Controls all writing with more
functionality [this talk]
How it’s going
How it started
Spark @ Stitch Fix - Spark Tooling
• Spark Sql + Pyspark + Scala
• Containerized Spark driver + AWS EMR (for compute)
• Used for production and staging ETL by Data Scientists
• Notebooks
• Jupyterhub setup with Stitch Fix libraries and python packages pre-installed.
• Used by Data Scientists to test and prototype
• Pandas-based Readers - Writers
• Reads and writes data using pandas dataframes
• No bootstrap time for Spark jobs - uses Apache Livy for execution
• Used for test + production
All the tooling available to Data Scientists to use Spark to read and write data
Spark @ Stitch Fix - Writing data to the warehouse
Spark @ Stitch Fix - Steps while writing data
At the start, and even today, writing data through the writer library
has these steps.
1. Validation - check dataframe for type matches, schema matches
to the Hive table, overflow type checks.
2. Writing the data into files in S3 - parquet or text format based on
the Hive table’s configuration
3. Update the Hive Metastore - with versioning scheme for data.
Spark @ Stitch Fix - Data Versioning
• Writing into a Partitioned Table (e.g partitioned by a date_column
for a date value of 20210527)
• s3:<bucket>/<hive_db_name>/<hive_table_name>/date_column=20210527/batch_id=epoch_ts
• Writing into a Non-Partitioned Table
• s3:<bucket>/<hive_db_name>/<hive_table_name>/batch_id=epoch_ts
We also add the latest write_timestamp to the Hive table metadata, to indicate when the last write
was done to the table.
Writing data into the Data Warehouse with versioning to distinguish old vs new data.
We add the epoch_timestamp of the write time to indicate the freshness of the data.
Since we have a single path to
validate and write to the Data
Warehouse, what other
common functionality could
we add to provide more value
to our Data Scientists?
Spark Writer Modules
Config driven transformations while writing data to the Data Warehouse
Spark Writer Modules - Adding modules
Adding them as transformations in the writer library was
straightforward. In addition, we had to:
• Make each module configurable via spark properties
• Make each module behave the same for every write pipeline
• Make them configurable to either block writing data or not in
the event of failure
• Add documentation for each module to help steer Data
Scientists
How do we add additional functionality to the writing pipeline behind the scenes?
Spark Writer Modules - 3 Modules
• Journalizer
• Data Cleanser
• Data Quality Checker
The 3 modules we built
Journalizer
Journalizing - Data can change
Example: Data about a client has the potential to change and we need to capture it
Note: These are Slowly Changing Dimensions (Type 2) - where we preserve the old values.
client_id favorite_color dress_style
10 blue formal
Current on Date: 2021-05-21
client_id favorite_color dress_style
10 black formal
Current on Date: 2021-05-22
client_id favorite_color dress_style
10 green formal
Current on Date: 2021-07-23
client_id favorite_color dress_style
10 purple formal
Current on Date: 2021-05-23
Journalizing - 2 ways of capturing historical information
▪ Record of all data - written daily and
partitioned by date
▪ Contains all records - duplicated
across partitions
▪ Difficult to find nuanced information
or track changes in data by date since
all the data is included.
▪ Harder to access the data because of
the size of the table
▪ Compressed, de-duped information
▪ Two partitions: is_current = 1 (latest
data) & is_current = 0 (old data)
▪ Tracks changing values by
timestamp. e.g sets start and end
date to a value to show duration of
validity
▪ Sorted for easy access by primary key
Journal Tables
History Tables
2 types of Hive Tables to store this information.
client_id favorite_color dress_style date
(partition_column)
10 blue formal 2021-05-20
10 blue formal 2021-05-21
10 black formal 2021-05-21
10 blue formal 2021-05-22
10 black formal 2021-05-22
10 purple formal 2021-05-22
….. ….. ….. …….
10 blue formal 2021-07-23
10 black formal 2021-07-23
10 purple formal 2021-07-23
10 green formal 2021-07-23
History Table Journal Table
client_id favorite_color start_date end_date is_current
(partition
column)
10 blue 2021-01-01
(first time
recorded)
2021-05-20 0
10 black 2021-05-21 2021-05-21 0
10 purple 2021-05-22 2021-07-22 0
10 green 2021-07-23 2999-01-01
(default end
time)
1
Note: Tracking changes to favorite_color
across time
Given the compressed nature of
Journal tables, we moved
historical data into them.
A Journal table is meant to be a
ledger of the change in values and
a pointer to the current values.
Let’s now look at how Journal
tables are created.
Journalizing - How do we create a journal table?
Some questions we asked ourselves:
1. How could we get easy access to latest information about a
particular key?
2. How can information be compressed and de-duplicated?
3. Can we determine - how long was the favorite_color set to
<value>?
4. But, how do we update the table each time to maintain this
ordering?
5. Where and when do we run this process of conversion?
What we need to get to the table structure?
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
10 blue 2021-05-22
10 purple 2021-05-23
Compression/ De-dupe client_i
d
favorite_color start_date end_date
10 blue 2021-01-01
(first time
recorded)
2021-05-22
10 purple 2021-05-23 2999-01-01
(default end
time)
Start date when
value was valid
End date when
value was valid
Symbolizing the
latest value
without a specified
end
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
Current Pointer
Partition
client_id favorite_color start_date end_date is_current
10 blue 2021-01-01
(first time
recorded)
2999-01-01
(default
end time)
1
In a history table, we
don’t know the
changed value since
it’s not marked.
client_id favorite_color date
10 blue 2021-05-20
10 blue 2021-05-21
10 blue 2021-05-22
10 purple 2021-05-22
client_id favorite_color start_date end_date is_current
10 blue 2021-01-01
(first time
recorded)
2021-05-21 0
10 purple 2021-05-22 2999-01-01
(default
end time)
1
Current Pointer
Partition
purple is now marked
as the current value,
and blue is moved to
the older partition
Journalizing - Process of Journalizing
1. User creates a Journal table and sets a field to track using
metadata e.g. (client_id is set as primary key)
2. When data is written to this table, the table is reloaded in its
entirety and we perform
a. Deduplication and compression
b. Set the current values in partitions - if there are changes
c. Sort the table based on the date
3. Rewrite this new DataFrame into the table
Journalizing - The workflow
Journalizing - Journal Table Pros & Cons
▪ De-duped data
▪ Two partitions for easy querying -
is_current = 1 (latest data) &
is_current = 0 (old data). Data pipeline
needs to access only 1 partition for all
the latest values.
▪ Compressed and timestamp to
indicate field values lifespan to track
changes
▪ Sorted for easy access by primary key
▪ Complicated process and multiple
steps prior to writing.
▪ Rewriting the table is a must to
maintain the rules of compression
and deduplication
Cons
Pros
Data Cleanser
Data Cleanser - What and why?
Data can be old or un-referenced or meant to be excluded.
• How do we make sure some record values don’t continue to
persist in a table?
• How do we delete records or nullify them consistently
throughout the warehouse?
• Can this be configured by the Data Scientists to apply to their
table?
Can we cleanse data based on a configuration?
Data Cleanser - What does cleansing mean?
Let’s say we wish to nullify/delete some column values in a table
id column_a column_b color style
9 value_a “string_field_1” blue formal
10 value_a1 “string_field_2” red casual
11 value_a2 “string_field_3” white formal
OR
Nullified
Deleted
id column_a column_b color style
9 null null blue formal
10 null null red casual
11 null null white formal
id column_a column_b color style
9 <empty> <empty> blue formal
10 <empty> <empty> red casual
11 <empty> <empty> white formal
Data Cleanser - Criteria
1. Has to be configurable
2. Users should be able to specify the key to be monitored and
columns for cleansing
3. At least, two treatments should be available:
a. nullify
b. delete
4. Should happen to data at write and/or at rest
What does the cleanser have to do?
Data Cleanser - How?
• How?
• Perform cleansing at write time to ensure all future records are cleansed
despite the source having included them.
• Separately, cleanse the entire Hive table of the data is not used - to make
sure older partitions don’t have the un-referenced data.
• What do we need?
• A mechanism to configure what to cleanse - nullify/delete per table
• This mechanism needs to be accessible at write / rest to run the cleansing
on the data.
How do we cleanse data?
Data Cleanser - Implementation
We have a metadata infrastructure that
allows users to add metadata to their
owned tables
▪ Hive tables have metadata fields that
can be used to store auxiliary
information about them
▪ The cleanser could simply access
the tables metadata and perform
cleansing accordingly.
Each table could have a configuration
naming columns like [column_a, column_b]
that needed to be cleansed along with the
treatment.
▪ Reacting to the specified metadata
meant the cleanser module could work
as configured at all times.
▪ The same module could perform
cleansing for data while writing and/or
at rest.
Cleansing
Table Configuration
Data Cleanser - The workflow
1. User specifics metadata configuration for cleansing in a Hive
table
metadata = {"key": "id",
"treatment": "nullify",
"columns": ["column_a", "column_b"]]}
2. Cleanser reads the table and checks all the columns that
match
3. Performs nullify/delete on the DataFrame and proceeds to the
next transformation or writes this cleansed DataFrame to the
Data warehouse.
How does it come together?
Data Cleanser - The workflow
Data Quality Checker
Data Quality - Background
• How do we detect errors or skews in data?
• When do we check for data problems?
• How do Data Scientists setup Data Quality checks?
What motivated the data quality initiative?
Data Quality - What do we need to check data?
• Service to initialize tests and run tests on Hive tables.
• Mechanism that calculates metrics based on the configured
tests on the data prior to writing it to the warehouse
• Interface that allows users to autonomously setup Data quality
and run tests on their pipelines.
What components were needed for running data quality checks?
Data Quality - What would a Test look like?
• NullCount(column_name)
• Is the null count on this column higher than “value”?
• Average(column_name)
• Is the average below what is expected?
• Max(column_name)
• Is the max value for this column exceeding a certain limit?
• RowCount(table)
• Are we suddenly writing more rows than anticipated?
Some examples of tests that we started off with.
Data Quality - How we built it?
• Built a service that was equipped to:
• Enable CRUD operations on tests for Hive tables
• Had the ability to run tests on metrics when triggered
• At the same time, we built in the ability to calculate metrics in a
module in the Spark writer library.
• This module interacted with the data quality service to find the metrics that
were needed to be calculated.
• Ran these calculations in Spark on the input DataFrame - e.g. average
(column_name)
• Triggered tests on these metrics and posted the results to the user.
Putting the components together
Data Quality - Surfacing Data Quality to users
1. The data quality service had a python client that helped users
run CRUD operations on tests
2. The writer module could be configured to run on a write
operation for a table.
a. Setting spark.enable.data.quality.checks=true in Spark properties helped run
these tests at write time.
3. Separately, we created an offline mode to run tests on already
written data, if the user doesn’t wish to block writes to the table.
What did the interface look like?
Spark Writer Modules - Transformations in code
def writeDataFrame(inputDataframe:DataFrame,
databaseName: String,
tableName: String) = {
// Validation
val validatedDataframe = sfWriter.validateDataframe(inputDataframe,databaseName,tableName)
// Journalizing
val journalizedDataframe = sfWriter.journalizeDataframe(validatedDataframe,databaseName,tableName)
// Data Cleanser
val cleansedDataframe = sfWriter.dataCleanser(journalizedDataframe,databaseName,tableName)
// Data Quality Checker
sfWriter.dataQualityChecker(cleansedDataframe,databaseName,tableName)
// Write to the Data Warehouse + Update Metastore
sfWriter.writeToS3(cleansedDataframe,databaseName,tableName)
}
Learnings & Future Work
What we learnt and where are we headed?
Learnings & Future Work - Lessons learnt
• Adding new modules meant more complexity to the write pipeline, but
each step was doing a valuable transformation
• Making each transformation performant and efficient was a top priority
when each module was being created.
• Testing - unit & integration was key in rolling out without mishaps
• Introducing these modules to Data Scientists meant we needed better
communication and more documentation
• Getting data quality checks to run efficiently was a challenge, since we
had to programmatically calculate the partitions of the DataFrame and
run tests against each potential Hive partition. This took some effort to
run smoothly.
By adding modularized transformations to data, what changed and how did we adapt?
Learnings & Future Work - Future Work
Now, additional modules can easily be added in a similar fashion
• Data Quality is being enhanced with support for customized testing
rather than simple threshold or values.
• The goal is to have Data quality ingrained in the ETL process of our
Data Science workflows.
• Journalizer and data cleansing are mostly static but we are exploring
alternate solutions to help augment and delete records more
efficiently.
By adding modularized transformations to data, what changed and how did we adapt?
Summary
TL;DR:
Summary
Writing data with Spark @ Stitch Fix:
• We have a singular write path to input data into the warehouse driven
by Spark
• 3 modules that perform transformations are config driven and
available at the time of write.
• Journalizing: Writing a non-duplicated historical record of data to help quick
access and compression.
• Data Cleanser: Delete or nullify values based on table configuration.
• Data Quality: Enabling the calculation of metrics and running tests on incoming
data into the warehouse.
Thank you.
Questions?

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 

Similar a Modularized ETL Writing with Apache Spark

DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
EzekielJames8
 
MMYERS Portfolio
MMYERS PortfolioMMYERS Portfolio
MMYERS Portfolio
Mike Myers
 
Db2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfallsDb2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfalls
sam2sung2
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 

Similar a Modularized ETL Writing with Apache Spark (20)

Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005Optimizing Data Accessin Sq Lserver2005
Optimizing Data Accessin Sq Lserver2005
 
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
Designing Database Solutions for Microsoft SQL Server 2012 2012 Microsoft 70-...
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
Evolutionary database design
Evolutionary database designEvolutionary database design
Evolutionary database design
 
SQL Server 2008 Development for Programmers
SQL Server 2008 Development for ProgrammersSQL Server 2008 Development for Programmers
SQL Server 2008 Development for Programmers
 
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
DSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BIDSN_Power BI
 
Sql Portfolio
Sql PortfolioSql Portfolio
Sql Portfolio
 
Tech-Spark: Scaling Databases
Tech-Spark: Scaling DatabasesTech-Spark: Scaling Databases
Tech-Spark: Scaling Databases
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 
ETL
ETL ETL
ETL
 
Sql Server 2005 Business Inteligence
Sql Server 2005 Business InteligenceSql Server 2005 Business Inteligence
Sql Server 2005 Business Inteligence
 
The strength of a spatial database
The strength of a spatial databaseThe strength of a spatial database
The strength of a spatial database
 
You Can Do It in SQL
You Can Do It in SQLYou Can Do It in SQL
You Can Do It in SQL
 
123448572 all-in-one-informatica
123448572 all-in-one-informatica123448572 all-in-one-informatica
123448572 all-in-one-informatica
 
Sql 2016 - What's New
Sql 2016 - What's NewSql 2016 - What's New
Sql 2016 - What's New
 
MMYERS Portfolio
MMYERS PortfolioMMYERS Portfolio
MMYERS Portfolio
 
Db2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfallsDb2 migration -_tips,_tricks,_and_pitfalls
Db2 migration -_tips,_tricks,_and_pitfalls
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Sap abap
Sap abapSap abap
Sap abap
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Último (20)

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Modularized ETL Writing with Apache Spark

  • 1. Modularized ETL Writing with Spark Neelesh Salian Software Engineer - Stitch Fix May 27, 2021
  • 3. Agenda ▪ What is Stitch Fix? ▪ Apache Spark @ Stitch Fix ▪ Spark Writer Modules ▪ Learnings & Future Work
  • 4. What is Stitch Fix? What does the company do?
  • 5. Stitch Fix is a personalized styling service Shop at your personal curated store. Check out what you like.
  • 6. Data Science is behind everything we do algorithms-tour.stitchfix.com • Algorithms org • 145+ Data Scientists and Platform engineers • 3 main verticals + platform
  • 7. Apache Spark @ Stitch Fix How we use Spark in our teams?
  • 8. Spark @ Stitch Fix - History and Current State ▪ Spark was introduced to enhance and scale ETL capabilities (circa 2016) ▪ Starting version: 1.2.x ▪ Spark SQL was the dominant use case ▪ Used for reading and writing data into the warehouse as Hive Tables. ▪ Current Version: 2.4.x, 3.1.x [ prototyping] ▪ For all ETL reads and writes, production and test ▪ Spark serves regular pyspark,sql and scala jobs, notebooks & pandas-based readers - writers ▪ Controls all writing with more functionality [this talk] How it’s going How it started
  • 9. Spark @ Stitch Fix - Spark Tooling • Spark Sql + Pyspark + Scala • Containerized Spark driver + AWS EMR (for compute) • Used for production and staging ETL by Data Scientists • Notebooks • Jupyterhub setup with Stitch Fix libraries and python packages pre-installed. • Used by Data Scientists to test and prototype • Pandas-based Readers - Writers • Reads and writes data using pandas dataframes • No bootstrap time for Spark jobs - uses Apache Livy for execution • Used for test + production All the tooling available to Data Scientists to use Spark to read and write data
  • 10. Spark @ Stitch Fix - Writing data to the warehouse
  • 11. Spark @ Stitch Fix - Steps while writing data At the start, and even today, writing data through the writer library has these steps. 1. Validation - check dataframe for type matches, schema matches to the Hive table, overflow type checks. 2. Writing the data into files in S3 - parquet or text format based on the Hive table’s configuration 3. Update the Hive Metastore - with versioning scheme for data.
  • 12.
  • 13. Spark @ Stitch Fix - Data Versioning • Writing into a Partitioned Table (e.g partitioned by a date_column for a date value of 20210527) • s3:<bucket>/<hive_db_name>/<hive_table_name>/date_column=20210527/batch_id=epoch_ts • Writing into a Non-Partitioned Table • s3:<bucket>/<hive_db_name>/<hive_table_name>/batch_id=epoch_ts We also add the latest write_timestamp to the Hive table metadata, to indicate when the last write was done to the table. Writing data into the Data Warehouse with versioning to distinguish old vs new data. We add the epoch_timestamp of the write time to indicate the freshness of the data.
  • 14. Since we have a single path to validate and write to the Data Warehouse, what other common functionality could we add to provide more value to our Data Scientists?
  • 15. Spark Writer Modules Config driven transformations while writing data to the Data Warehouse
  • 16. Spark Writer Modules - Adding modules Adding them as transformations in the writer library was straightforward. In addition, we had to: • Make each module configurable via spark properties • Make each module behave the same for every write pipeline • Make them configurable to either block writing data or not in the event of failure • Add documentation for each module to help steer Data Scientists How do we add additional functionality to the writing pipeline behind the scenes?
  • 17. Spark Writer Modules - 3 Modules • Journalizer • Data Cleanser • Data Quality Checker The 3 modules we built
  • 18.
  • 20.
  • 21. Journalizing - Data can change Example: Data about a client has the potential to change and we need to capture it Note: These are Slowly Changing Dimensions (Type 2) - where we preserve the old values. client_id favorite_color dress_style 10 blue formal Current on Date: 2021-05-21 client_id favorite_color dress_style 10 black formal Current on Date: 2021-05-22 client_id favorite_color dress_style 10 green formal Current on Date: 2021-07-23 client_id favorite_color dress_style 10 purple formal Current on Date: 2021-05-23
  • 22. Journalizing - 2 ways of capturing historical information ▪ Record of all data - written daily and partitioned by date ▪ Contains all records - duplicated across partitions ▪ Difficult to find nuanced information or track changes in data by date since all the data is included. ▪ Harder to access the data because of the size of the table ▪ Compressed, de-duped information ▪ Two partitions: is_current = 1 (latest data) & is_current = 0 (old data) ▪ Tracks changing values by timestamp. e.g sets start and end date to a value to show duration of validity ▪ Sorted for easy access by primary key Journal Tables History Tables 2 types of Hive Tables to store this information.
  • 23. client_id favorite_color dress_style date (partition_column) 10 blue formal 2021-05-20 10 blue formal 2021-05-21 10 black formal 2021-05-21 10 blue formal 2021-05-22 10 black formal 2021-05-22 10 purple formal 2021-05-22 ….. ….. ….. ……. 10 blue formal 2021-07-23 10 black formal 2021-07-23 10 purple formal 2021-07-23 10 green formal 2021-07-23 History Table Journal Table client_id favorite_color start_date end_date is_current (partition column) 10 blue 2021-01-01 (first time recorded) 2021-05-20 0 10 black 2021-05-21 2021-05-21 0 10 purple 2021-05-22 2021-07-22 0 10 green 2021-07-23 2999-01-01 (default end time) 1 Note: Tracking changes to favorite_color across time
  • 24. Given the compressed nature of Journal tables, we moved historical data into them. A Journal table is meant to be a ledger of the change in values and a pointer to the current values. Let’s now look at how Journal tables are created.
  • 25. Journalizing - How do we create a journal table? Some questions we asked ourselves: 1. How could we get easy access to latest information about a particular key? 2. How can information be compressed and de-duplicated? 3. Can we determine - how long was the favorite_color set to <value>? 4. But, how do we update the table each time to maintain this ordering? 5. Where and when do we run this process of conversion? What we need to get to the table structure?
  • 26. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 10 blue 2021-05-22 10 purple 2021-05-23 Compression/ De-dupe client_i d favorite_color start_date end_date 10 blue 2021-01-01 (first time recorded) 2021-05-22 10 purple 2021-05-23 2999-01-01 (default end time) Start date when value was valid End date when value was valid Symbolizing the latest value without a specified end
  • 27. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 Current Pointer Partition client_id favorite_color start_date end_date is_current 10 blue 2021-01-01 (first time recorded) 2999-01-01 (default end time) 1 In a history table, we don’t know the changed value since it’s not marked. client_id favorite_color date 10 blue 2021-05-20 10 blue 2021-05-21 10 blue 2021-05-22 10 purple 2021-05-22 client_id favorite_color start_date end_date is_current 10 blue 2021-01-01 (first time recorded) 2021-05-21 0 10 purple 2021-05-22 2999-01-01 (default end time) 1 Current Pointer Partition purple is now marked as the current value, and blue is moved to the older partition
  • 28. Journalizing - Process of Journalizing 1. User creates a Journal table and sets a field to track using metadata e.g. (client_id is set as primary key) 2. When data is written to this table, the table is reloaded in its entirety and we perform a. Deduplication and compression b. Set the current values in partitions - if there are changes c. Sort the table based on the date 3. Rewrite this new DataFrame into the table
  • 29. Journalizing - The workflow
  • 30. Journalizing - Journal Table Pros & Cons ▪ De-duped data ▪ Two partitions for easy querying - is_current = 1 (latest data) & is_current = 0 (old data). Data pipeline needs to access only 1 partition for all the latest values. ▪ Compressed and timestamp to indicate field values lifespan to track changes ▪ Sorted for easy access by primary key ▪ Complicated process and multiple steps prior to writing. ▪ Rewriting the table is a must to maintain the rules of compression and deduplication Cons Pros
  • 32.
  • 33. Data Cleanser - What and why? Data can be old or un-referenced or meant to be excluded. • How do we make sure some record values don’t continue to persist in a table? • How do we delete records or nullify them consistently throughout the warehouse? • Can this be configured by the Data Scientists to apply to their table? Can we cleanse data based on a configuration?
  • 34. Data Cleanser - What does cleansing mean? Let’s say we wish to nullify/delete some column values in a table id column_a column_b color style 9 value_a “string_field_1” blue formal 10 value_a1 “string_field_2” red casual 11 value_a2 “string_field_3” white formal OR Nullified Deleted id column_a column_b color style 9 null null blue formal 10 null null red casual 11 null null white formal id column_a column_b color style 9 <empty> <empty> blue formal 10 <empty> <empty> red casual 11 <empty> <empty> white formal
  • 35. Data Cleanser - Criteria 1. Has to be configurable 2. Users should be able to specify the key to be monitored and columns for cleansing 3. At least, two treatments should be available: a. nullify b. delete 4. Should happen to data at write and/or at rest What does the cleanser have to do?
  • 36. Data Cleanser - How? • How? • Perform cleansing at write time to ensure all future records are cleansed despite the source having included them. • Separately, cleanse the entire Hive table of the data is not used - to make sure older partitions don’t have the un-referenced data. • What do we need? • A mechanism to configure what to cleanse - nullify/delete per table • This mechanism needs to be accessible at write / rest to run the cleansing on the data. How do we cleanse data?
  • 37. Data Cleanser - Implementation We have a metadata infrastructure that allows users to add metadata to their owned tables ▪ Hive tables have metadata fields that can be used to store auxiliary information about them ▪ The cleanser could simply access the tables metadata and perform cleansing accordingly. Each table could have a configuration naming columns like [column_a, column_b] that needed to be cleansed along with the treatment. ▪ Reacting to the specified metadata meant the cleanser module could work as configured at all times. ▪ The same module could perform cleansing for data while writing and/or at rest. Cleansing Table Configuration
  • 38. Data Cleanser - The workflow 1. User specifics metadata configuration for cleansing in a Hive table metadata = {"key": "id", "treatment": "nullify", "columns": ["column_a", "column_b"]]} 2. Cleanser reads the table and checks all the columns that match 3. Performs nullify/delete on the DataFrame and proceeds to the next transformation or writes this cleansed DataFrame to the Data warehouse. How does it come together?
  • 39. Data Cleanser - The workflow
  • 41.
  • 42. Data Quality - Background • How do we detect errors or skews in data? • When do we check for data problems? • How do Data Scientists setup Data Quality checks? What motivated the data quality initiative?
  • 43. Data Quality - What do we need to check data? • Service to initialize tests and run tests on Hive tables. • Mechanism that calculates metrics based on the configured tests on the data prior to writing it to the warehouse • Interface that allows users to autonomously setup Data quality and run tests on their pipelines. What components were needed for running data quality checks?
  • 44. Data Quality - What would a Test look like? • NullCount(column_name) • Is the null count on this column higher than “value”? • Average(column_name) • Is the average below what is expected? • Max(column_name) • Is the max value for this column exceeding a certain limit? • RowCount(table) • Are we suddenly writing more rows than anticipated? Some examples of tests that we started off with.
  • 45. Data Quality - How we built it? • Built a service that was equipped to: • Enable CRUD operations on tests for Hive tables • Had the ability to run tests on metrics when triggered • At the same time, we built in the ability to calculate metrics in a module in the Spark writer library. • This module interacted with the data quality service to find the metrics that were needed to be calculated. • Ran these calculations in Spark on the input DataFrame - e.g. average (column_name) • Triggered tests on these metrics and posted the results to the user. Putting the components together
  • 46. Data Quality - Surfacing Data Quality to users 1. The data quality service had a python client that helped users run CRUD operations on tests 2. The writer module could be configured to run on a write operation for a table. a. Setting spark.enable.data.quality.checks=true in Spark properties helped run these tests at write time. 3. Separately, we created an offline mode to run tests on already written data, if the user doesn’t wish to block writes to the table. What did the interface look like?
  • 47.
  • 48. Spark Writer Modules - Transformations in code def writeDataFrame(inputDataframe:DataFrame, databaseName: String, tableName: String) = { // Validation val validatedDataframe = sfWriter.validateDataframe(inputDataframe,databaseName,tableName) // Journalizing val journalizedDataframe = sfWriter.journalizeDataframe(validatedDataframe,databaseName,tableName) // Data Cleanser val cleansedDataframe = sfWriter.dataCleanser(journalizedDataframe,databaseName,tableName) // Data Quality Checker sfWriter.dataQualityChecker(cleansedDataframe,databaseName,tableName) // Write to the Data Warehouse + Update Metastore sfWriter.writeToS3(cleansedDataframe,databaseName,tableName) }
  • 49. Learnings & Future Work What we learnt and where are we headed?
  • 50. Learnings & Future Work - Lessons learnt • Adding new modules meant more complexity to the write pipeline, but each step was doing a valuable transformation • Making each transformation performant and efficient was a top priority when each module was being created. • Testing - unit & integration was key in rolling out without mishaps • Introducing these modules to Data Scientists meant we needed better communication and more documentation • Getting data quality checks to run efficiently was a challenge, since we had to programmatically calculate the partitions of the DataFrame and run tests against each potential Hive partition. This took some effort to run smoothly. By adding modularized transformations to data, what changed and how did we adapt?
  • 51. Learnings & Future Work - Future Work Now, additional modules can easily be added in a similar fashion • Data Quality is being enhanced with support for customized testing rather than simple threshold or values. • The goal is to have Data quality ingrained in the ETL process of our Data Science workflows. • Journalizer and data cleansing are mostly static but we are exploring alternate solutions to help augment and delete records more efficiently. By adding modularized transformations to data, what changed and how did we adapt?
  • 53. Summary Writing data with Spark @ Stitch Fix: • We have a singular write path to input data into the warehouse driven by Spark • 3 modules that perform transformations are config driven and available at the time of write. • Journalizing: Writing a non-duplicated historical record of data to help quick access and compression. • Data Cleanser: Delete or nullify values based on table configuration. • Data Quality: Enabling the calculation of metrics and running tests on incoming data into the warehouse.