In Zalando's microservice architecture, each service continuously generates streams of events for the purposes of inter-service communication or data integration. Some of these events describe business processes, e.g. a customer has placed an order or a parcel has been shipped. Out of this, the need to materialize event streams from the central event bus into persistent cloud storage evolved. The temporarily persisted data is then integrated into our relational data warehouse. In this talk we present a materialization engine backed by Apache Flink. We show how we employ Flink’s RESTful API, custom accumulators and stoppable sources to provide another API abstraction layer for deploying, monitoring and controlling our materialization jobs. Our jobs compact event streams depending on event properties and transform their complex JSON structures into flat files for easier integration into the data warehouse.
4. 4
Europe's leading online fashion platform
15 countries
~21 million active customers
~3.6 billion € revenue 2016
250,000+ products
2,000 brands
13,000+ employees in Europe
5. 5
WE ARE CONSTANTLY INNOVATING TECHNOLOGY
HOME-BREWED,
CUTTING-EDGE
& SCALABLE
technology solutions
~ 1,800
employees from
tech locations
+ HQs in Berlin6
77
nations
help our brand to
WIN ONLINE
tech.zalando.com
7. 7
MICROSERVICES ARCHITECTURE
REST API
Nakadi Event Bus
REST API
Business
Logic
Database
RESTAPI
AppA
Business
Logic
Database
RESTAPI
AppB
Business
Logic
Database
RESTAPI
AppC
Business
Logic
Database
RESTAPI
AppD
Everything runs on Amazon Web Services
8. 8
NAKADI - CENTRAL EVENT BUS
A distributed event bus that implements a RESTful API
abstraction over Kafka-like queues.
● 1000s of Kafka topics
● high variance in event structure, size & throughput
● consumer can specify partition offsets and batch
size in the GET request
https://github.com/zalando/nakadi
10. 10
DATA INTEGRATION CHALLENGE
?
REST API
Nakadi Event Bus
REST API
App A App B App DApp C Relational Data Warehouses
Oracle
Amazon
Redshift
Other
RDBMS
11. 11
REQUIREMENTS
Goal: Consume data streams in a relational database
friendly way
● Materialize data from Nakadi event bus into cloud storage
● Transform complex JSON events into easily ingestible
CSV flat files, incl. the flattening of arrays
● Relieve load on the (monolithic) data warehouse by
compacting event streams according to event properties
13. 13
DATA INTEGRATION CHALLENGE
REST API
Nakadi Event Bus
REST API
App A App B App DApp C Relational Data Warehouses
Oracle
Amazon
Redshift
Other
RDBMS
AWS S3
REST API
Materialization Engine
REST API
21. 21
STREAM COMPACTION: MOTIVATION
● Relinquish DWH resources by reducing the size of the
data to ingest
● Can be applied to events which represent changes to
the same resource, i.e. having the same partitioning
key, but different ordering keys:
{
"article_id": 123,
"brand": "Nike",
"model": "Air Max",
"version": 1,
"available_qty": 50,
}
{
"article_id": 123,
"brand": "Nike",
"model": "Air Max",
"version": 2,
"available_qty": 30,
}
22. 22
STREAM COMPACTION: IMPLEMENTATION
Deploy Flink jobs with config as parameter from
Request Handler.
Compact the stream according to the partitioning and
ordering keys.
Compaction rates up to 70%
KeyBy
SessionWindow
in ProcessingTime
Reducer. . . . . .
23. 23
STREAM COMPACTION: OUT-OF-ORDER EVENTS
Out-of-order events = Events which were received not in
the order we expected, e.g. higher version value first
● Materialize out-of-order events in a different CSV file
(correction file)
○ Generated as a side output from main stream
● Rel. DWH creates a view from main and corrected files
before merging into Operational Data Store (ODS) table
24. 24
DWH CLIENT’S IMPORT PROCESS
One main
CSV file per
partition
One corrected
CSV file per
partition
External Table
MAIN
External Table
CORRECTION
View
all records from CORRECTION and all from MAIN that are not
in CORRECTION
DWH: ODS LayerPost Processing Hook
Processes newly extracted data, e.g. merges into ODS table
26. 26
LEGACY DATA INTEGRATION ARCHITECTURE
REST API
Nakadi Event Bus
REST API
App A App B App DApp C
Stream Processing
via Apache Flink
Exporter
REST API
Importer
Relational Data Warehouses
Oracle
Amazon
Redshift
Other
RDBMS
Configuration
Manager
AWS S3
27. 27
ADVANTAGES OVER LEGACY APPROACH
Fewer stacks: Flink + Materialization API + ZooKeeper
instead of Importer + Kafka + Flink + Exporter API + ZooKeeper
● Reduced AWS costs
● Decreased operational overhead
○ No data redundancy (Nakadi + Kafka), no Importer setup
○ Far less maintenance, e.g. no streaming array flattening jobs
● Easier reasoning and implementation through Flink’s API
○ Compaction
○ Extensible feature set (more ETL in the future)