A talk presented by Max Schultze from Zalando and Arif Wider from ThoughtWorks at NDC Oslo 2020.
Abstract:
The Data Lake paradigm is often considered the scalable successor of the more curated Data Warehouse approach when it comes to democratization of data. However, many who went out to build a centralized Data Lake came out with a data swamp of unclear responsibilities, a lack of data ownership, and sub-par data availability.
At Zalando - europe’s biggest online fashion retailer - we realised that accessibility and availability at scale can only be guaranteed when moving more responsibilities to those who pick up the data and have the respective domain knowledge - the data owners - while keeping only data governance and metadata information central. Such a decentralized and domain focused approach has recently been coined a Data Mesh.
The Data Mesh paradigm promotes the concept of Data Products which go beyond sharing of files and towards guarantees of quality and acknowledgement of data ownership.
This talk will take you on a journey of how we went from a centralized Data Lake to embrace a distributed Data Mesh architecture and will outline the ongoing efforts to make creation of data products as simple as applying a template.
Unlocking the Future of AI Agents with Large Language Models
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes Beyond the Data Lake
1. Data Mesh in Practice
Max Schultze - max.schultze@zalando.de
Arif Wider - awider@thoughtworks.com
12-06-2020
How Europe’s Leading
Online Platform for Fashion
Goes Beyond the Data Lake
@mcs1408 @arifwider
2. 2
Max Schultze
● Lead Data Engineer
● MSc in Computer Science
● Took part in early
development of Apache Flink
● Retired semi-professional
Magic: the Gathering player
Who are we?
Arif Wider
● Lead Technology Consultant
● Head of AI, ThoughtWorks Germany
● Scala & FP enthusiast
● Coffee geek
3. 7000+ technologists with 43 offices in 14 countries
Partner for technology driven business transformation
Barcelona - Madrid - London - Manchester - Berlin - Hamburg - Munich - Cologne
16. 16
Centralization Challenges
Datasets provided by data agnostic infrastructure team
● Lack of ownership
Pipeline responsibility on data agnostic infrastructure team
● Lack of quality
Organizational scaling
● Central team becomes the bottleneck
17. 17
A Recurring Pattern
Product teams
generating data
Data engineers
maintaining the
data platform
Decisions makers,
data scientists
consuming data
20. 20
What is Data Mesh?
Old wine applied to new bottles…
→ Product Thinking
→ Domain-Driven Distributed Architecture
→ Infrastructure as a Platform
… creates value from Data
21. 21
Data as a Product
Data
Product
What is my market?
What are the desires of
my customers?
What “price” is justified?
How to do marketing?
What’s the USP?
Are my customers happy?
24. 24
Domain-Driven Distributed Data Architecture
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
(governed by open
standard)
Secure (governed by
global access control)
Domain
24
→ The Data Product is the
fundamental building block
Aggregated
Domain
25. 25
Self-Service Data Infrastructure
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
Data infra
engineers
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
(governed by open
standard)
Secure (governed by
global access control)
Domain
25
→ The Data Product is the
fundamental building block
Aggregated
Domain
26. 26
Global Governance & Open Standards
Enable interoperability
An Ecosystem of Data Products
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
Data infra
engineers
Discoverable
Addressable
Self-describing
Trustworthy
Interoperable
(governed by open
standard)
Secure (governed by
global access control)
Domain
26
→ The Data Product is the
fundamental building block
Aggregated
Domain
27. 27
It’s a mindset shift
FROM TO
Centralized ownership Decentralized ownership
Pipelines as first class concern Domain Data as first class concern
Data as a by-product Data as a Product
Siloed Data Engineering Team Cross-functional Domain-Data Teams
Centralized Data Lake / Warehouse Ecosystem of Data Products
29. 29
Recap:
● From Bottleneck to Infra Platform
Data Mesh in Practice
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
30. 30
Recap:
● From Bottleneck to Infra Platform
● From Data Monolith to Interoperable Services
Data Mesh in Practice
Data Infra as a Platform
Storage, pipeline, catalogue, access control, etc
central
data
platform
35. 35
Central Services with Global Interoperability
Decentralized ownership does not imply decentralized infrastructure!
Interoperability is created through convenient solutions of a self service platform.
Decentral Storage Central Infrastructure
Decentral Ownership Central Governance
37. 37
Recap:
● Datasets provided through pipelines of data agnostic infrastructure teams
Data Mesh in Practice
?
Who is allowed to share data?
What are the criteria to enable data consumers?
How to ensure data quality?
38. 38
How to Ensure Data Quality?
Make conscious decisions
● Opt-in instead of default storage
39. 39
How to Ensure Data Quality?
Make conscious decisions
● Opt-in instead of default storage
● Classification of data usage
40. 40
Data Quality - A Contract between Consumer and Producer
Behavioral changes for data producers
● Data is a product not a by-product
41. 41
Behavioral changes for data producers
● Data is a product not a by-product
● Dedicate resources to
○ Understand usage
○ Ensure quality
Data Quality - A Contract between Consumer and Producer
43. 43
Into the Future
● Domain Enterprise Architecture
○ Definition of domain responsibilities
○ Appointment of domain specific experts
44. 44
Into the Future
● Domain Enterprise Architecture
○ Definition of domain responsibilities
○ Appointment of domain specific experts
● “Off the shelf” data products
○ De-centralized archiving
○ Template driven data preparation
45. 45
Data Mesh in Practice
How Europe’s Leading
Online Platform for Fashion
Goes Beyond the Data Lake
Max Schultze
max.schultze@zalando.de
@mcs1408
Arif Wider
awider@thoughtworks.com
@arifwider