The challenge with today’s “data explosion” is finding the most appropriate answer to the question, “So where do I put my data?” while avoiding the longer-term problem: data warehouses, data lakes, cloud storage, NoSQL databases, … are often the places where “big” data goes to die.
Enter Physics 101, and my corollary to Newton’s First Law of Motion:
Data in motion tends to stay in motion until it comes rest on disk. Similarly, if data is at rest, it will remain at rest until an external “force” puts it in motion again.
Data inevitably comes to rest at some point. Without “external forces”, data often gets lost or becomes stale where it lands. “Modern” architectures tend to involve data pipelines where downstream consumers of data make use of data generated upstream, often with built-for-purpose repositories at each stage. This session will explore how data that has come to rest can be put in motion again; how Kafka can keep it in motion longer; and how pipelined architectures might be created to make use of that data.
3. 3
Data Warehouse Automation
Streaming Data Pipeline Automation
Design, Manage & Monitor
Modernize and Automate Data Integration
CDC Streaming
Azure
SQL DW
Amazon
Redshift
Managed Data Lake Creation
Generate
Change Data
Streams
Deliver
To Clouds,
Lakes…
Refine &
Merge
For Analytics,
AI/ML, Data
Science…
AI/ML
Analytics
Data
Science
Model
Commit
Conform
Consume
Catalog
Shop, Prepare & Provision
Catalog
Shop, Prepare & Provision
RDBMS
Data Warehouse
Files
Mainframe
SAAS
APPS
SAP
Amazon RDS Azure SQL DB
Google Cloud SQL
4. 4
Streaming Data Pipeline Automation
Design, Manage & Monitor
Our Focus for Today: Qlik Replicate & Kafka
Generate
Change Data
Streams
Deliver
To Clouds,
Lakes…
Refine &
Merge
For Analytics,
AI/ML, Data
Science…
RDBMS
Data Warehouse
Files
Mainframe
SAAS
APPS
SAP
7. 7
An object will not change its
motion unless acted on by an
unbalanced force.
• If it is at rest, it will stay at rest
• If it is in motion, it will remain at the
same velocity
Corollary: Objects with greater mass
have more inertia. It therefore takes
more force to change their motion.
Newton’s First Law of
Motion
Inertia
8. 8
Data in motion tends to stay in motion until it
comes rest on disk.
Similarly, if data is at rest, it will remain at rest
until an external “force” puts it in motion
again.
— John Neal *
* With apologies to Sir Isaac Newton
9. 9
Writing Data to a Database Introduces Friction
Data in Motion
Friction
How do we get the
data moving
again?
STOP
10. 10
Get Landed Data Moving
Overcoming Storage “Friction”
File I/O (reads)
• Parsing challenges
• No deltas
Database Queries
• Not real-time
• Added database load
Database Triggers
• Added database load
• Doesn’t scale
ETL Tools
• Not real-time
• Added database load
• Getting deltas is hard
Qlik Replicate
• Real-time
• Reads the DB logs
• CDC provides delta processing
12. 12
“Modern” Applications Leverage Microservices
• Components are “decoupled” and have well-defined interfaces
- Changes are easier to make because they are localized and isolated
- Results in increased reliability
- Allows for a faster release schedule supporting agile approaches
- Increases opportunity to innovate
• Microservices can use “purpose built” storage rather than a central
repository
- Teams are free to choose the most appropriate repository for the problem
at hand … a relational database is not always the answer.
• Data flows between components
Microservices
13. 13
Data Catalog
Microservice-Based Applications
A Bucket of Bricks
Data Warehouse
Automation
Media
Data Streaming
(CDC)
Analytics
Security
Kafka
Streaming
Services
Event Processing
RDBMS
Wide-Column
Store
Spark /
ML
Cloud DW
Hadoop
Key-Value
Store
Graph DB
(NoSQL)
File Storage
Document
Store
(NoSQL)
IoT
Qlik
14. 14
Lambda-Style Architectures
Streaming and batch working together
NoSQL
IoT
Mobile
Apps
Web
Legacy
DB/DW
Incoming Data
Streaming (Speed) Layer
Serving Layer
Batch Layer
Stream Processing
(Spark Streaming,
Storm, Flink, …)
Incremental
Views
All Data Pre-Compute
Views
(Spark, M/R, HQL, …)
Real-time Views
Batch Views
Queries /
ML /
Analytics
Ingest & Store Prepare / Curate Publish ConsumeData
15. 15
Kappa-Style Architectures
Where everything is a stream
Streaming Data
Streaming Layer
Stream Processing
(Spark Streaming,
Storm, Flink, …)
Real-time Results
Serving Layer
Real-time View
Queries /
ML /
Analytics
Mirror events
to long term
storage
Storage Layer
Raw Data History
Re-compute
events from
storage if
needed
Historical View
Ingest & Store Prepare / Curate Publish ConsumeData
23. 23
Summarizing Key Points
Physics applies to data
Qlik Replicate delivers
data from databases to
Kafka in real-time.
“Modern” architectures
want data to be in
motion.
Kafka is a key
component.
Feedback loops can be
a useful way to keep
data moving