A talk given by Dr. Arif Wider (ThoughtWorks) and Sebastian Herold (Zalando) at OOP 2018 in Munich.
Abstract:
More and more companies migrate their monolithic applications to a microservices architecture. However, maintaining a consistent and usable data landscape has only become more challenging by this: huge amounts of structured and unstructured data, and hundreds of data sources.
Furthermore, data-driven product development multiplies the analytics requirements: every product team needs constantly updated and specially tailored metrics which often combine product specific data with company wide data.
Having a centralized data team does not scale in this setting as it becomes the bottleneck between data producers and data consumers.
We created a Manifesto of seven principles which break with traditional separation of roles and show a path how to deal with distributed data in a federal and scalable fashion. This leads to DataDev: a culture shift similar to DevOps in which application developers own their data and take over responsibilities for data & analytics.
Learn about our experiences and best practices with facilitating this cultural transformation at Scout24, the provider of Europe’s largest online markets for cars and real estate.
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
1. DataDevOps: A Manifesto for a
DevOps-like Culture Shift
in Data & Analytics
Dr. Arif Wider & Sebastian Herold
Munich, Feb 7th, 2018
2. Seite 2
Dr. Arif Wider
- Senior Consultant/Dev
- Scala/FP enthusiast
- ThoughtWorks Germany
data strategy group
@arifwider
Sebastian Herold
- Chief Data Architect
@Scout @Scout24
until Dec
- BigData Architect
@Zalando from Jan
- Data Evangelist
@heroldamus
3. Seite 3
Road to MicroService Architecture – How we started in 2007
BI Tool
Middle
Tier
DWH
Staging
Core DB
CRM
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
2007
Web
Tier
Analyst
BI Dev
4. Seite 4
Road to MicroService Architecture – How things got complicated in 2011
BI Tool
Middle
Tier
DWH
Staging
Core DB
CRM
Web
2011
API
APP
$$$
APPMySQL
Analyst
BI Dev
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
5. APPMySQL
APPMySQL
APPMySQL
Seite 5
Road to MicroService Architecture – How we sliced the monolith in 2013
BI Tool
DWH
StagingCRM
Web
2013
API
APPMySQL
Core DB
EXP
Mongo
SEA
Elastic
Sync APP
APIAPI
API
HADOOP
REST API
Analyst
BI Dev
DE
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
6. AWS
APP
APP
APP
APPMySQL
APPMySQL
APPMySQL
Seite 6
Road to MicroService Architecture – How a central data team doesn’t scale
BI Tool
DWH
StagingCRM
Web
2015
API
APPMySQL
Core DB
EXP
Mongo
SEA
Elastic
Sync APP
APIAPIAPI
HADOOP
REST API
APPAPP
Analyst
BI Dev
DE
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
7. Core DB APPAPPAPPAPPAPPAPPAPPAPPAPP
AWS
Seite 7
Road to MicroService Architecture – How we rearchitectured our Data Landscape
BI Tool
DWH
Central Data Lake on S3
CRM
2017
Core DB APP
REST API
Analyst
DE
BI Dev
APPAPPAPP
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
8. Seite 8
Scout24 wants to become a truly data-driven company
Fast & easy data-driven
product development…
…supported by
Data & Analytics
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
9. Seite 9
Scout24 wants to become a truly data-driven company
Everywhere in the company... ...without bloating up D‘n‘A
Image source: https://www.oddsemiconductorservices.com/
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
11. Seite 11
SCOUT24 DATA LANDSCAPE MANIFESTO
#1 Preamble
Data is a key asset of our
company.
SCOUT24 DATA LANDSCAPE MANIFESTO
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
12. Seite 12
#2 Our Responsibility
We, Data & Analytics, are
responsible for providing a
solid Data Platform as well
as clear guidelines and
training how to participate
in the Data Landscape.
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
D’n’A
Data Landscape
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
13. Seite 13
SCOUT24 DATA LANDSCAPE MANIFESTO
#3 Data Autonomy, Not Anarchy
Data autonomy puts data
producers & data consumers
in control of their data & of
their metrics and thereby
allows us to be data-driven
at scale, but this comes with
responsibility.
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
Data
Producer Consumer
D’n’A
Data Landscape
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
14. Seite 14
Roles & Responsibilities
Central Data Lake on S3
Checkout
service
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
Special
offer
service
D’N’A
Producer
Consumer
Data Catalog
D’n’A
15. Seite 15
SCOUT24 DATA LANDSCAPE MANIFESTO
#4 Producer’s Responsibility
Data producers are
responsible for publishing
data to the central Data
Lake, for the data's quality,
and for publishing metadata
that makes it easy to find
and consume the data.
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
Metadata
Data
Producer
D’n’A
Data Landscape
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
16. Data Catalog
Seite 16
Roles & Responsibilities
Central Data Lake on S3
Checkout
service
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
order events
Special
offer
service
Producer
Consumer
D’n’A
17. Data Catalog
Seite 17
Roles & Responsibilities
Central Data Lake on S3
Checkout
service
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
order events
Special
offer
service
Ingestion Template
Producer
Consumer
D’n’A
18. Seite 18
SCOUT24 DATA LANDSCAPE MANIFESTO
#5 Consumer’s Responsibility
Data consumers are
responsible for the definition
& visualization of metrics
and for driving the imple-
mentation and maintenance
of these metrics.
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
Producer Consumer
D’n’A
Data Landscape
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
19. Data Catalog
Seite 19
Roles & Responsibilities
Central Data Lake on S3
Checkout
service
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
order events
Special
offer
service
View: order history by userIngestion Template
Producer
Consumer
D’n’A
20. Seite 20
SCOUT24 DATA LANDSCAPE MANIFESTO
#6 Exception: Core KPIs
We, Data & Analytics, take
the full ownership and
responsibility of the few top
company-wide core KPIs.
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
Producer Consumer
D’n’A
Data Landscape
Core
metric
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
21. Data Catalog
Seite 21
Roles & Responsibilities
BI Tool
Central Data Lake on S3
Analyst
Checkout
service
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
order events
Special
offer
service
View: order history by user
View: revenue generated
from orders by segments
Ingestion Template
Producer
Consumer
D’n’A
22. Seite 22
SCOUT24 DATA LANDSCAPE MANIFESTO
#7 Transparency Over Continuity
We value data transparency
over data continuity, which
means we may break metric
comparability if it is for the
cause of enabling better
insights.
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
Producer Consumer
D’n’A
Data Landscape
Core
metric
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
23. Seite 23
SCOUT24 DATA LANDSCAPE MANIFESTO
The Ultimate Goal
SCOUT24 DATA LANDSCAPE MANIFESTO
Data Platform
Metadata
Data
Producer Consumer
D’n’A
Data Landscape
Core
metric
Data
products
A federal landscape of data
producers and consumers
with just enough rules to
ensure seamless co-
operation without severely
impeding autonomy.
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
24. Seite 24
Consequences for Product
Development Teams?
- Think about data & reporting
- Deliver your data to the lake
- Provide meta data (schema, descriptions, versions)
- Eat your own dog food: Consume your own data
for reporting -> take responsibility for data quality
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
25. Seite 25
Benefits for Product Development
Teams?
- Independently work with data
- No dependencies to data teams
- Company data is curated and it’s easy to consume
data produced by other teams
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
27. Seite 27
Learnings and lessons
Publish exhaustive, general, and denormalized event data
Avoid consumer-specific tailoring of data you publish
Consume your own data, e.g. for KPI reports
Try out ad-hoc analytics notebooks to get better insights
Inform data producers, if you rely on their data
Invest in documentation and guidelines for your data
platform to keep your effort for support low
DataDevOps – Data Manifesto | Sebastian Herold & Arif Wider
Perspective of a data engineer
-> in reality much complexer -> such to simplify things
Let’s go back 10 years to 2007 (someparts are even older than that)
Applicatoin: clean 3-tier-architecture
Web Tier
Middle Tier
Operative Oracle DB
(Klick) Analysts wanted to create reports
(Klick) own DWH to not block the CoreDB for analytical queries
Core DB -> Staging -> DWH -> BI Tool
2011:
more and more Systems needed to be integrated into DWH
One size fits all database approach DB doesn’t scale anymore, more different load profiles
pay big amount of money to Oracle
More systems needed to be integrated into DWH
2013 (4 years ago):
Beginning of the chaos
DB scaling problem solved -> Denormalisation of data: own DB for search, detail pages -> synchronization of the data
More microservices showed up that provide data
More unstructured data which do not fit into classical relational data storages
Build Hadoop Cluster
Not for inserting single events
REST-API in front, collects events of same type and package them in bigger chunks and copies them to HDFS
Easy Reporting for applications
JSON for business reporting is the new standard, completely different then the previous relational world
Standardisation thourgh company wide unique IDs
Direct connection to BI Tools
More and more analysts and data scientists directly work on the cluster by using Hive, Spark, etc.
2015:
We had a complete chaos
More and more applications
Cloud strategy -> on the long run we should put everything to AWS
Most of the time we were maintaining mappings
My team need to collect metadata all the time and deeply understand the different domains
Central bootleneck for whole company
No one could introduce new or change data without us
People got mad at us
We needed to change something
2017:
(Klick) Merge BI Developers and Data Engineers into one team
(Klick) establish a central data lake within AWS
Leading system for structured and unstructured data, easy to connect/join things
Why S3?
Cheap & reliable, at least cheaper and more reliable than most of the people in the room could provide
Integrated into most of the current big data technologies
Access through many clusters at once
Performance deadvantage not that big, intermedite results will kept in HDFS sometimes
(Klick) DWH just a cache fpr analytical queries
(Klick) old applications in our on-premise data center still use Hadoop Rest API
(Klick) direct exports from databases
(Klick) CRM imports and exports data
(Klick) New applications stream data through Kinesis Firehose
These are the requirements that the dev teams could easily ingest data to the data platform and data could be join
of course, this is a birds view, in reality it’s much more complex
And then another topic came along, but Arif will tell you about it
- Because of microservices the amount and heterogeneity of data sources has multiplied.
- Sebastian explained nicley how this can be tackled with a more appropriate technical approach.
- However, in parallel to this technical development, there was a also strong push for data-driven product development happening at Scout.
- What does this mean? Culture of Experimentation (small cycles)
- …this means that now also the number of data consumers in the company has multiplied.
- These consumers want to correlate their specific data with that general data warehouse data.
- DNA wants to help but but company does not want to spend the resources to equally multiply the data team.
- As a result the data team was even more becoming a bottleneck and the frustration on both sides went up
- Often because of unclear responsibilities or a distribution of responsibilities that had not changed since 2007
- Therefore we realised that it is not enough to put the techincal organisation on a new solid foundation but also the way how people interact with data and manage responsibilities about data needed a new foundation.
- To signal a new thinking here, we had to idea to formulate a Data Landscape Manifesto which we as a company would agree on.
- This is about roles, responsibilities and common values
- Consists of 7 principles, which are each based on a assumption or a belief from which we derived that principle.
We believe that collecting & analyzing data is crucial to understand our business, our customers, and the market in order to provide the right services & products
Although this is nothing surprising these days, we wanted to start with this in order to ensure a common understanding of why all of this is important in the first place.
--> Loosely coupled (Microservices), strongly ALIGNED (Jez Humble, Adrian Cockroft)
We therefore believe that everyone in the company must have easy access to the data available and it must be easy to publish data which can be used by others. This requires a solid Data Platform: easy-to-use tools, reliable infrastructure , and simple guidelines for publishing & consuming data.
…
This is our core responsibility (and we wanted to start with this side).
The data landscape is the playground on which data producers and data consumers interact. We provide the platform and the clear guidelines but we do not own that space .
The reason for this is that we believe..
We believe that an exhaustive centralized data management does not allow us to scale to the level of data creation and consumption we aspire as a company, because it creates a bottleneck and introduces accidental, indirect dependencies. Instead , we believe that data autonomy is the only way for data usage to scale across the company. However, for data autonomy to not become data anarchy, there has to be a clear set of basic rules and responsibilities.
Data autonomy puts…
Introduce roles
We believe that extensive data availability, data discoverability, and data usability are crucial and that – at scale – no one else can ensure this other than the one controlling the source where the data is originally generated.
We believe that the stakeholder of a metric has to be the single owner of that metric and its definition, and has to drive its implementation.
Without a single source of truth about what a metric means, we risk that multiple diverging and possibly contradicting understandings and implementations develop over time.
We believe that a minimum level of company-wide compar-ability& reliability of core KPIs is crucial for leading the company into the right direction.
The management is the owner of these core KPIs and the data group represents the management here in terms of metric ownership.
We believe that transparency is crucial for understanding what the meaning of a metric is.
If month-to-month comparability must never break, there is no way to continuously improve metrics and their transparency based on new insights.
To stay in the example: if we actually understand that a certain number of orders are actually fraud than we want to report the actual real revenue.
A federal landscape of data producers and consumers with just enough rules to ensure seamless co-operation without severely impeding autonomy.
What does it mean to product development team in their day-to-day business?
(Klick) Think about data:
Reporting, how to structure data? And
Which database should I use, at least in AWS there are tons of options
Maybe you need to maintain it yourself
(Klick) They need to bring the data theirself (supported by data platform team/documentation)
(Klick) They need to provide metadata:
Schema
Description
Connectivity (ids matching other ids in the lake)
Versionint
(Klick) Eat your own dog food: Use your delivered data for your own reporting
Twist in responsibility: Data-Quality is managed by the producer
-> understand the reporting infrastructure
-> Take the view of a data consumer and understand what other people do with the data
What is the benefit?
No waiting for Data Team -> work indepedently
Their data and data from other team is easier to use and can be easily integrated into their, because everybody is using the same paradigm
So we just heard more responsibility and required skills on the one side, but therefore less dependencies and decreased cycle time on the other side.
This sounds a lot like what DevOps is preaching.
…
Publish exhaustive, general, and denormalized event data
Avoid consumer-specific tailoring of data you publish
Consume your own data, e.g. for KPI reports
Try out ad-hoc analytics notebooks to get better insights
Inform data producers, if you rely on their data
Invest in documentation and guidelines for your data platform to keep your effort for support low