- The document discusses the convergence of asynchronous micro-services and unified log architectures. Asynchronous micro-services are replacing traditional synchronous micro-services and analytical workflows are moving to unified log architectures.
- A unified log, like Apache Kafka, provides a single version of truth and unravels the "hairball" of separate data integrations. It allows for wider data coverage, full data history, and low latency queries.
- Asynchronous micro-services are loosely coupled through event streams, which provides advantages over synchronous request/response services like easier upgrades and reduced failures. There will be more tools for building asynchronous services and open source schema registries will be important.
2. Introducing myself
• Alexander Dean
• Co-founder and technical lead at Snowplow, the
open-source event data pipeline
• Weekend writer of Unified Log Processing,
available on the Manning Early Access Program
• Co-author at Snowplow of Iglu, our open-source
schema registry system, and Sauna, our open-
source decisioning and response platform
3. We are witnessing the convergence of two separate technology
tracks towards asynchronous or event-driven micro-services
Transactional workloads
Analytical workloads
Asynchronous
micro-services
Software monoliths
Synchronous
micro-services
Classic data
warehousing
Hybrid data
pipelines
Unified log
architectures
5. A quick history lesson: the three eras of business data
processing [1]
1. Classic data warehousing, 1996+
2. Hybrid data pipelines, 2005+
3. Unified log architectures, 2013+
[1] http://snowplowanalytics.com/blog/
2014/01/20/the-three-eras-of-business-data-processing/
6. The era of classic data warehousing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point
connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
Management
reporting
ERP
Silo
Local loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY
7. The era of hybrid data pipelines, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
8. The hybrid era: a surfeit of software vendors
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
Local loop
LOW LATENCY LOCAL LOOPS
E-comm
Silo
Local loop
CRM
Local loop
SAAS VENDOR #2
Email
marketing
Local loop
ERP
Silo
Local loop
CMS
Silo
Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream
processing
Product
rec’s
Micro-batch
processing
Systems
monitoring
Batch
processing
Data
warehouse
Management
reporting
Batch
processing
Ad hoc
analytics
Hadoop
SAAS VENDOR #3
Web
analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
9. The hybrid era: company-wide reporting and
analytics ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcutter’s story
10. The hybrid era: the number of data integrations
is unsustainable
12. The advent of the unified log, 2013+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’
DATA HISTORY
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
13. CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email
marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs /
web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems
monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s
Ad hoc
analytics
Management
reporting
Fraud
detection
Churn
prevention
APIs
The unified log is Amazon Kinesis, or Apache Kafka
• Amazon Kinesis, a
hosted AWS service
• Extremely similar
semantics to Kafka
• Apache Kafka, an append-
only, distributed, ordered
commit log
• Developed at LinkedIn to
serve as their
organization’s unified log
14. “Kafka is designed to allow a
single cluster to serve as the
central data backbone for a
large organization” [1]
[1] http://kafka.apache.org/
15. So what does a unified log give us?
A single version of the truth
Our truth is now upstream from the data warehouse
The hairball of point-to-point connections has been
unravelled
Local loops have been unbundled
1
2
3
4
16. Coming up to the end of 2016, and unified log architectures have
seen extremely rapid and widespread adoption
18. In parallel, we have seen a steady (if spotty) rejection of
software monoliths for transactional workloads
19. In a micro-services architecture, the individual capabilities of the
system are split out into separate services
Synchronous
communication using
request and response
(often using RESTful
HTTP or RPC)
20. What do synchronous micro-services give us?
Strong module boundaries
• Network boundaries between modules can be
helpful for larger teams
Independent deployment
• Deploy individual micro-services independently
• Simpler to deploy and less likely to cause whole
system failures
Support diversity
• Can use the best language/framework/database
for the capability
• Reduces monoculture risk (anti-fragile)
1
2
3
22. When we re-architected Snowplow around the unified log 2.5
years ago, we designed it around small, composable workers
Diagram from our February 2014
Snowplow v0.9.0 release post
23. This was based on the insight that real-time pipelines can be
composed a little like Unix pipes
24. We avoided monolithic Spark Streaming or Storm jobs, based on
our experiences with “heavy” Hadoop jobs in our batch pipeline
Kinesis
stream
Kinesis
stream
Kinesis
stream
Kinesis
stream
Kinesis
stream
What we didn’t do: an inner Storm topology
25. We wanted to avoid the “inner topology” effect, with
effectively two tiers of topology to reason about
Difficult to unit test the inner topologies – complex
behaviours inside each unit
Difficult to operationalize the inner topologies – how
do they handle backpressure, how do they scale,
how do we upgrade them?
Difficult to monitor the inner topologies
Fundamental problem: the event streams in an inner
topology are not first class entities
1
2
3
26. It worked: today the Snowplow real-time pipeline is a collection
of individual event-driven micro-services
Stream
Collector
Stream
Enrich
Kinesis S3
Kinesis
Elasticsearch
Kinesis Tee
Kinesis
Redshift
(design stage)
User’s AWS
Lambda
function
User’s KCL
worker app
User’s Spark
Streaming
job
27. “the most compelling applications for
stream processing are actually pretty
different from what you would typically
do with a Hive or Spark job—they are
closer to being a kind of asynchronous
microservice rather than being a faster
version of a batch analytics job. …”
Meanwhile, the Kafka team (now at Confluent) were
seeing something interesting in the adoption of Kafka…
28. “…What I mean is that these stream
processing apps were most often
software that implemented core
functions in the business rather than
computing analytics about the
business.” [1]
[1] http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
And these micro-services were substituting not just for batch
analytical workloads, but also for transactional workloads
29. Why are async micro-
services replacing “classic”
request/response micro-
services (at least amongst
Kafka users)?
30. To start with, asynchronous micro-services have the same
benefits as synchronous micro-services…
Strong module boundaries
• Network boundaries between modules can be
helpful for larger teams
Independent deployment
• Deploy individual micro-services independently
• Easier to deploy and less likely to cause whole
system failures
Support diversity
• Can use the best language/framework/database
for the capability
• Reduces monoculture risk (anti-fragile)
1
2
3
31. … but in addition, event-driven asynchronous micro-
services are extremely loosely-coupled, because they are
intermediated by first class streams
32. Compare this to request/response synchronous micro-services,
which have dependencies between upstream and downstream
Downstream
Upstream
Single-page web app
Login service
Notifications
service
Content
service
Customer profile service
Personalization
service
Ad service
33. If you can afford the (small) latency tax, there are some
clear advantages in going asynchronous
Much better toolkit for upgrades – schema evolution,
running old and new service versions in parallel etc
Adding new downstream services doesn’t increase the
load on upstream services
Failure of individual services introduces lag into the
overall system, rather than overall system failure
Easier to debug, because service inputs and outputs
are directly inspectable in the event streams
1
2
3
4
35. More and more tooling for building asynchronous micro-
services, including the unstoppable rise of “serverless”
Kinesis Client Library
AWS Lambda
IBM OpenWhisk
Event-driven cloud functionsStream processing as a library
Azure Functions
36. Greater adoption of open source schema registries as the
canonical source of truth for the events in our topologies
Confluent or Iglu
schema registries
37. Request/response micro-services aren’t going away –
they are just too useful
But expect a move from slower HTTP-based to faster RPC-
based options
• e.g. Google’s gRPC
• These are easier to read from and write from in high-
volume event-driven architectures
Expect wider adoption of API definition languages
• OpenAPI (Swagger), RAML, API Blueprint
Eventual harmonization of types?
• Using JSON for RESTful APIs, Protocol Buffers for RPC
and Avro for stream processing is crazy
• Needs company-wide standardization, or dynamic
translation (but this is lossy)
1
2
3
38. We also need new fabrics – or extensions of existing ones like
Kubernetes – to address the challenges of running our topologies
“How do we
monitor this
topology, and
alert if
something
(data loss;
event lag) is
going wrong?”
“How do we
scale our
streams and
micro-services
to handle
event peaks
and troughs
smoothly?”
“How do we
re-configure or
upgrade our
micro-services
without
breaking
things?”
39. At Snowplow we are working on a unified log fabric,
called Tupilak, to solve this problem
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
Confluent Schema Registry – an integral part of the Confluent Platform for Kafka-based data pipelines
https://github.com/confluentinc/schema-registry
Iglu – an integral part of the Snowplow open source event data pipeline
https://github.com/snowplow/iglu
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log
We have a single version of the truth – together, the unified log plus Hadoop archive represent our single version of the truth. They contain exactly the same data - our event stream - they just have different time windows of data
The single version of the truth is upstream from the data warehouse – in the classic era, the data warehouse provided the single version of the truth, making all reports generated from it consistent. In the unified era, the log provides the single version of the truth: as a result, operational systems (e.g. recommendation and ad targeting systems) compute on the same truth as analysts producing management reports
Point-to-point connections have largely been unravelled - in their place, applications can append to the unified log and other applications can read their writes
Local loops have been unbundled - in place of local silos, applications can collaborate on near-real-time decision-making via the unified log