Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time.
The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.
2. Sebastian Herold, Zalando SE
Data Warehousing with
Spark Streaming @
Zalando
#UnifiedDataAnalytics #SparkAISummit
3. 3
# Principal Data Engineer / Architect
# 7y @ Immo-/Scout24
# DataDevOps Manifesto
# Data Platform
# 2y @ Zalando
# ML Productivity
# Streaming DWH
@heroldamus
Data Warehousing with Spark Streaming
Sebastian Herold
4. 4
WE BRING FASHION TO PEOPLE
2008-2009
2010
2012-2013
2011
2018
17 markets
9 fulfillment centers
>28M active customers
5.4B revenue 2018
>300M visits/month
>14k employees
>400k product choices
>80% visits from mobile
5. 5
TECH@SCALE
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
>350 accounts
>100 clusters
>250 teams
>5 data lakes
API
>800 micro services
9. LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
DRAWBACKS OF CENTRAL DWH
10. MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT INTEGRATION
DRAWBACKS OF CENTRAL DWH
11. HEAVY INTEGRATION OF
UNSTRUCTURED DATA
INTO RELATIONAL TABLES
DATASETS ARE NEEDED
DISTRIBUTED
LOWER LATENCY REQUIRED BY
AI USE-CASES,
OTHER DATA WAREHOUSES,
NEAR-REALTIME USE-CASES
MULTIPLE TEAMS DO SAME
LOW-LATENCY EVENT
INTEGRATION
STREAMING
12. 12
SALES ORDER EXAMPLE
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
order.created
order_id,
order_date,
items,
...
shipment.created
order_id,
shipping_date,
shipped_items,
...
payment.done
payment_id,
payment_date,
order_id,
...
item.returned
order_id,
return_date,
returned_item,
...
sales-order
order_id,
order_date,
payment_id,
payment_date,
items:
shipped_at,
returned_at,
...
calculated_1,
calculated_2
...
13. 13
HOW WE STARTED?
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
Topics
Streaming
S3
nakadi.io
S3 Delta Table
WAIT!
Downstream
14. 14
INTEGRATION OF HISTORIC DATA
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
Topics
Streaming
S3
nakadi.io
S3 Delta TableCentral
DWH Bootstrap
Delta Table
BOOM!
Batch time increased
to 2h !!
MERGE command
slow for needles in
the haystack
Downstream
15. 15
INTRODUCE SNAPSHOTS AND CHANGES TABLE
Topics
Streaming
S3
nakadi.io
S3Central
DWH Bootstrap
Delta Table
Downstream
Snapshot Changes
Snapshotter
Better, but still slow!
18. 18 Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
SQL vs SCALA
# Started with 200 lines of SQL
# Grew fast to 400 lines
# Violated DRY principle
# Hard to unit-test
# Hard to refactor
# Bad support for nested structures
SCALA
19. 19
LESSONS LEARNED
Data Warehousing with Spark Streaming - Spark + AI Summit Amsterdam ‘19
# Streaming needs different thinking
# DWH ~ Backend Programming
# Don’t start with SQL because it’s easy
# Databricks Delta succeeds Parquet
# Make sure all data is available in S3