Apache Kafka has come the modern central point for a fast and scalable streaming platform. Now, thanks to the open source explosion over the last decade, there are now numerous data stores available as sinks for Kafka-brokered data, from search to document stores, columnular DBs, time series DBs and more. While many claim they are the swiss army knife, in reality each is designed for specific types of data and analytics approaches. In this talk, we will cover the taxonomy of various data sinks, delve into each categories pros, cons and ideal use cases, so you can select the right ones and tie them together with Kafka into a well-considered architecture.
5. What are you gonna tell ‘em?
5
1.How did we get to kafka? Haven’t we been here before?
2.Five Kafka use cases and their OSS sinks.
3.Popular examples of each of these sinks
4.Compare and Contrast- Polyglot Persistence FTW!
6. Evolution of Streaming Data
6https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
7. Evolution of Streaming Data
7https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
8. Death of OLAP vs OLTP
8https://www.slideshare.net/KaiWaehner/apache-kafka-vs-integration-middleware-mq-etl-esb
ESB
MQ
Uptime
Transactions
ACID
Applications
Speed
Short Request
ETL
Reporting
Analytics
Cubes
Batch
Long Request
POLT OLAPScalable
9. Its Just Data Now
9https://www.slideshare.net/KaiWaehner/apache-kafka-vs-integration-middleware-mq-etl-esb
ESB
MQ
Uptime
Transactions
ACID
Applications
Speed
Short Request
ETL
Reporting
Analytics
Cubes
Batch
Long Request
PP
10. Its Just Data Now
10https://www.slideshare.net/KaiWaehner/apache-kafka-vs-integration-middleware-mq-etl-esb
Uptime
Transactions
ACID
Applications
Speed
Short Request
Reporting
Analytics
Cubes
Batch
Long Request
PP
11. Top 5 Use Cases for Kafka…
11https://kafka.apache.org/uses
31. ! Use case: Ensure Quality of Streaming Services
! Data set: they has over 30+ front facing applications and
portals (set-top boxes, etc). To ensure the quality of service of
these applications, data is collected from the sources and
surfaced to an internal customized real-time user dashboard
interface. Metrics tracked include day over day quality,
content analytics, bitrates by geographic region
! Business Goal: Being able to view real-time data of how their
services are behaving allows their operation teams to take
actions to maintain a high quality of service for their users.
Metric Collection
31https://speakerdeck.com/implydatainc/druid-at-charter
32. Metrics and Event Sourcing
32
Box emitting
metrics
https://speakerdeck.com/implydatainc/druid-at-charter
34. What is Spark Streaming?
34https://spark.apache.org/docs/latest/streaming-programming-guide.html
35. ! Use case: Collect, analyze, and diagnose network flow data for
visualizing traffic among all possible pairs of sources and
destinations in a given internet domain
! Data set: Enriched network flows (application and network
behavior)
! Goals:
○ Provide real-time visibility for capacity planning, traffic
engineering, resource optimization, revenue leak detection
○ Proactively identify and rapidly resolve issues
Stream Processing
35
41. What is ElasticSearch?
41
Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on
Apache Lucene. Since its release in 2010, Elasticsearch has quickly become the most popular
search engine, and is commonly used for log analytics, full-text search, security
intelligence, business analytics, and operational intelligence use cases.
48. 48Confidential. Do not redistribute.
Search
platform
OLAP
! Real-time ingestion
! Flexible schema
! Full text search
! Batch ingestion
! Efficient storage
! Fast analytic queries
Timeseries
database
! Optimized storage for
time-based datasets
! Time-based functions
49. Data warehouses
Tightly coupled architecture with limited flexibility.
49
Data
Data
Data
Data Sources
ETL Data
warehouse
Processing Store and Compute
Analytics
Reporting
Data mining
Querying
50. Data lakes
Modern data architectures are more application-centric.
50Confidential. Do not redistribute.
Data
Data
Data
Data Sources
MapReduce, Spark Apps
ETL
SQL
ML/AI
TSDB
Data
lake
Storage
51. Data rivers
Streaming architectures are true-to-life and enable faster decision cycles.
51Confidential. Do not redistribute.
Data
Data
Data
Data Sources
Stream processors
Stream
hub
Streaming
analytics
Databases
ETL
Storage
Apps
Archive to
data lake
52. Website Activity Monitoring
52https://imply.io/post/clickstream-analysis-open-source-divolte-kafka-druid
• Is our campaign working, right now?
• Are we getting more visitors today than we did this time last week?
• Is now a good time to publish content changes targeted at a particular geography?
• Should we target adverts to a different referring website today?
• Have yesterday’s SSO changes made that impact we’ve been looking for?
• How has the release of a new browser this week affected our customer profile? Do we need
to adapt our website code?
56. Druid vs Cassandra
56https://imply.io/post/apache-cassandra-vs-apache-druid
Always On, Fast, Scalable Applications on single partition reads
vs
Low latency OLAP and AdHoc queries over entire datasets
Need seamless multi data center replication?
Are query patterns adhoc or unknown?
Need to do full table scans?
Have more writes than reads and need millisecond response time?
58. Spark Streaming vs Druid
58
Stream processing
vs
fast SQL queries on historical and real time data
Need the full power of Scala to do transformations?
Want to query real time and historical data?
Looking for tiered storage and automatic backups?
Want to push results to other systems directly?
59. Druid vs Elasticsearch
59
Search vs Analytics
Complex AdHoc Queries?
Text prediction?
Aggregation at ingestion?
Totally unstructured data?