Data Warehousing with Spark Streaming at Zalando

•

0 recomendaciones•607 vistas

Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time. The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.

Datos y análisis

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Sebastian Herold, Zalando SE
Data Warehousing with
Spark Streaming @
Zalando
#UnifiedDataAnalytics #SparkAISummit

3
# Principal Data Engineer / Architect
# 7y @ Immo-/Scout24
# DataDevOps Manifesto
# Data Platform
# 2y @ Zalando
# ML Productivity
# Streaming DWH
@heroldamus
Data Warehousing with Spark Streaming
Sebastian Herold

4
WE BRING FASHION TO PEOPLE
2008-2009
2010
2012-2013
2011
2018
17 markets
9 fulfillment centers
>28M active customers
5.4B revenue 2018
>300M visits/month
>14k employees
>400k product choices
>80% visits from mobile

5
TECH@SCALE
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
>350 accounts
>100 clusters
>250 teams
>5 data lakes
API
>800 micro services

WHY OUR CENTRAL DWH
DOES NOT SUCCEED
ANYMORE?

HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DRAWBACKS OF CENTRAL DWH

DATASETS ARE NEEDED
DISTRIBUTED
DRAWBACKS OF CENTRAL DWH

LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH

MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH

HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
STREAMING

12
SALES ORDER EXAMPLE
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
order.created
order_id,
order_date,
items,
...
shipment.created
order_id,
shipping_date,
shipped_items,
...
payment.done
payment_id,
payment_date,
order_id,
...
item.returned
order_id,
return_date,
returned_item,
...
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...

13
HOW WE STARTED?
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
Topics
Streaming
S3
nakadi.io
S3 Delta Table
WAIT!
Downstream

14
INTEGRATION OF HISTORIC DATA
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
Topics
Streaming
S3
nakadi.io
S3 Delta TableCentral
DWH Bootstrap
Delta Table
BOOM!
Batch time increased
to 2h !!
MERGE command
slow for needles in
the haystack
Downstream

15
INTRODUCE SNAPSHOTS AND CHANGES TABLE
Topics
Streaming
S3
nakadi.io
S3Central
DWH Bootstrap
Delta Table
Downstream
Snapshot Changes
Snapshotter
Better, but still slow!

16
LOAD SNAPSHOT INTO CLUSTER
Topics
Streaming
S3
nakadi.io
S3Central
DWH Bootstrap Downstream
Snapshot Changes
Snapshotter
Snapshot

17
WHAT’S COMING NEXT?
Topics
Streaming
S3
nakadi.io
Central
DWH Bootstrap Downstream
Snapshotter
S3 Snapshot Changes
Snapshot
Changes
State
Store
???
Snapshotter

18 Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violated DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
SCALA

19
LESSONS LEARNED
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Databricks Delta succeeds Parquet
# Make sure all data is available in S3

Más contenido relacionado

La actualidad más candente

Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation

Delta lake and the delta architectureAdam Doyle

Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri

Owning Your Own (Data) Lake HouseData Con LA

Hoodie: Incremental processing on hadoopPrasanna Rajaperumal

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks

Building a Virtual Data Lake with Apache ArrowDremio Corporation

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Data Lakehouse Symposium | Day 4Databricks

Hudi architecture, fundamentals and capabilitiesNishith Agarwal

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks

Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan AgrawalDatabricks

Get Savvy with SnowflakeMatillion

Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko

La actualidad más candente (20)

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Delta lake and the delta architecture

Big Query - Utilizing Google Data Warehouse for Media Analytics

Owning Your Own (Data) Lake House

Hoodie: Incremental processing on hadoop

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

Presto on Apache Spark: A Tale of Two Computation Engines

Building a Virtual Data Lake with Apache Arrow

Apache Iceberg - A Table Format for Hige Analytic Datasets

Data Lakehouse Symposium | Day 4

Hudi architecture, fundamentals and capabilities

Scaling your Data Pipelines with Apache Spark on Kubernetes

Architect’s Open-Source Guide for a Data Mesh Architecture

Building Lakehouses on Delta Lake with SQL Analytics Primer

Data Lakehouse Symposium | Day 1 | Part 2

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal

Get Savvy with Snowflake

Streaming SQL for Data Engineers: The Next Big Thing?

Similar a Data Warehousing with Spark Streaming at Zalando

Journey of Building Streaming Data Pipelines - BED-Con 2019Sebastian Herold

TDWI Schweiz 2019 - Building Streaming Data WarehousesSebastian Herold

Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks

SAP BI 2019 SAP TechEd takeawayRonald Konijnenburg

Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks

Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters

Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks

Successful AI/ML Projects with End-to-End Cloud Data EngineeringDatabricks

ETL Made Easy with Azure Data Factory and Azure DatabricksDatabricks

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Build a Big Data Warehouse on the Cloud in 30 MinutesCaserta

Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spar...Databricks

Data analytics at a petabyte scale finalOri Reshef

Execution Plans in practice - how to make SQL Server queries faster - Damian ...ITCamp

Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy

SpotfireSudarsan Desikan

Snowflake for Data EngineeringHarald Erb

Hybrid Integration with SAPGlenn Colpaert

Managing data analytics in a hybrid cloudKaran Singh

DataStax: Making a Difference with Smart AnalyticsDataStax Academy

Similar a Data Warehousing with Spark Streaming at Zalando (20)

Journey of Building Streaming Data Pipelines - BED-Con 2019

TDWI Schweiz 2019 - Building Streaming Data Warehouses

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

SAP BI 2019 SAP TechEd takeaway

Databricks + Snowflake: Catalyzing Data and AI Initiatives

Customer migration to azure sql database from on-premises SQL, for a SaaS app...

Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...

Successful AI/ML Projects with End-to-End Cloud Data Engineering

ETL Made Easy with Azure Data Factory and Azure Databricks

Spark + AI Summit 2020 イベント概要

Build a Big Data Warehouse on the Cloud in 30 Minutes

Road to Enterprise Architecture for Big Data Applications: Mixing Apache Spar...

Data analytics at a petabyte scale final

Execution Plans in practice - how to make SQL Server queries faster - Damian ...

Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...

Spotfire

Snowflake for Data Engineering

Hybrid Integration with SAP

Managing data analytics in a hybrid cloud

DataStax: Making a Difference with Smart Analytics

Más de Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 2Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 2

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Jeeves Grows Up: An AI Chatbot for Performance and Quality

Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Último

Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo

Role of Consumer Insights in business transformationAnnie Melnic

Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56

DATA ANALYSIS using various data sets like shoping data set etclalithasri22

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics

Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181

why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole

World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia

IBEF report on the Insurance market in IndiaManalVerma4

Data Analysis Project: Stroke PredictionBoston Institute of Analytics

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics

Digital Marketing Plan, how digital marketing worksdeepakthakur548787

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)

Insurance Churn Prediction Data Analysis ProjectBoston Institute of Analytics

2023 Survey Shows Dip in High School E-Cigarette UseBisnar Chase Personal Injury Attorneys

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics

knowledge representation in artificial intelligencePriyadharshiniG41

Principles and Practices of Data VisualizationKianJazayeri1

Data Warehousing with Spark Streaming at Zalando

1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

2. Sebastian Herold, Zalando SE Data Warehousing with Spark Streaming @ Zalando #UnifiedDataAnalytics #SparkAISummit

3. 3 # Principal Data Engineer / Architect # 7y @ Immo-/Scout24 # DataDevOps Manifesto # Data Platform # 2y @ Zalando # ML Productivity # Streaming DWH @heroldamus Data Warehousing with Spark Streaming Sebastian Herold

4. 4 WE BRING FASHION TO PEOPLE 2008-2009 2010 2012-2013 2011 2018 17 markets 9 fulfillment centers >28M active customers 5.4B revenue 2018 >300M visits/month >14k employees >400k product choices >80% visits from mobile

5. 5 TECH@SCALE Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 >350 accounts >100 clusters >250 teams >5 data lakes API >800 micro services

6. WHY OUR CENTRAL DWH DOES NOT SUCCEED ANYMORE?

7. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DRAWBACKS OF CENTRAL DWH

8. DATASETS ARE NEEDED DISTRIBUTED DRAWBACKS OF CENTRAL DWH

9. LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES DRAWBACKS OF CENTRAL DWH

10. MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION DRAWBACKS OF CENTRAL DWH

11. HEAVY INTEGRATION OF UNSTRUCTURED DATA INTO RELATIONAL TABLES DATASETS ARE NEEDED DISTRIBUTED LOWER LATENCY REQUIRED BY AI USE-CASES, OTHER DATA WAREHOUSES, NEAR-REALTIME USE-CASES MULTIPLE TEAMS DO SAME LOW-LATENCY EVENT INTEGRATION STREAMING

12. 12 SALES ORDER EXAMPLE Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 order.created order_id, order_date, items, ... shipment.created order_id, shipping_date, shipped_items, ... payment.done payment_id, payment_date, order_id, ... item.returned order_id, return_date, returned_item, ... sales-order order_id, order_date, payment_id, payment_date, items: shipped_at, returned_at, ... calculated_1, calculated_2 ...

13. 13 HOW WE STARTED? Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 Topics Streaming S3 nakadi.io S3 Delta Table WAIT! Downstream

14. 14 INTEGRATION OF HISTORIC DATA Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 Topics Streaming S3 nakadi.io S3 Delta TableCentral DWH Bootstrap Delta Table BOOM! Batch time increased to 2h !! MERGE command slow for needles in the haystack Downstream

15. 15 INTRODUCE SNAPSHOTS AND CHANGES TABLE Topics Streaming S3 nakadi.io S3Central DWH Bootstrap Delta Table Downstream Snapshot Changes Snapshotter Better, but still slow!

16. 16 LOAD SNAPSHOT INTO CLUSTER Topics Streaming S3 nakadi.io S3Central DWH Bootstrap Downstream Snapshot Changes Snapshotter Snapshot

17. 17 WHAT’S COMING NEXT? Topics Streaming S3 nakadi.io Central DWH Bootstrap Downstream Snapshotter S3 Snapshot Changes Snapshot Changes State Store ??? Snapshotter

18. 18 Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 SQL vs SCALA # Started with 200 lines of SQL # Grew fast to 400 lines # Violated DRY principle # Hard to unit-test # Hard to refactor # Bad support for nested structures SCALA

19. 19 LESSONS LEARNED Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19 # Streaming needs different thinking # DWH ~ Backend Programming # Don’t start with SQL because it’s easy # Databricks Delta succeeds Parquet # Make sure all data is available in S3

20. THANKS A LOT! QUESTIONS? WE ARE HIRING!

Data Warehousing with Spark Streaming at Zalando

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Data Warehousing with Spark Streaming at Zalando

Similar a Data Warehousing with Spark Streaming at Zalando (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Data Warehousing with Spark Streaming at Zalando