Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedreschi, Imply Data) Kafka Summit SF 2019

5 Fabulous OSS Sinks for Kafka
#3 will surprise you!
Rachel Pedreschi
Senior Director, Worldwide Field Engineering and Community
rachel@imply.io

Just for Fun*
4
https://tinyurl.com/KafkaSummit2019

What are you gonna tell ‘em?
5
1.How did we get to kafka? Haven’t we been here before?
2.Five Kafka use cases and their OSS sinks.
3.Popular examples of each of these sinks
4.Compare and Contrast- Polyglot Persistence FTW!

Evolution of Streaming Data
6https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Evolution of Streaming Data
7https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Death of OLAP vs OLTP
8https://www.slideshare.net/KaiWaehner/apache-kafka-vs-integration-middleware-mq-etl-esb
ESB
MQ
Uptime
Transactions
ACID
Applications
Speed
Short Request
ETL
Reporting
Analytics
Cubes
Batch
Long Request
POLT OLAPScalable

Its Just Data Now
ESB
MQ
Uptime
Transactions
ACID
Applications
Speed
Short Request
ETL
Reporting
Analytics
Cubes
Batch
Long Request
PP

Its Just Data Now
Uptime
Transactions
ACID
Applications
Speed
Short Request
Reporting
Analytics
Cubes
Batch
Long Request
PP

Top 5 Use Cases for Kafka…
11https://kafka.apache.org/uses

…And their 5 Fabulous OSS Sinks
12

#1 Kafka as a Message Broker and Event
Source for Always On applications with
Apache Cassandra
13

Real Time Message Delivery and Event Sourcing
28https://www.conﬂuent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/

#2 Metric collection for realtime
monitoring with InfluxDB
29

What is InfluxDB?
30
https://docs.inﬂuxdata.com/inﬂuxdb/v1.7/

! Use case: Ensure Quality of Streaming Services
! Data set: they has over 30+ front facing applications and
portals (set-top boxes, etc). To ensure the quality of service of
these applications, data is collected from the sources and
surfaced to an internal customized real-time user dashboard
interface. Metrics tracked include day over day quality,
content analytics, bitrates by geographic region
! Business Goal: Being able to view real-time data of how their
services are behaving allows their operation teams to take
actions to maintain a high quality of service for their users.
Metric Collection
31https://speakerdeck.com/implydatainc/druid-at-charter

Metrics and Event Sourcing
32

Box emitting
metrics
https://speakerdeck.com/implydatainc/druid-at-charter

#3 Stream Processing with Apache
Spark
33

What is Spark Streaming?
34https://spark.apache.org/docs/latest/streaming-programming-guide.html

! Use case: Collect, analyze, and diagnose network flow data for
visualizing traffic among all possible pairs of sources and
destinations in a given internet domain
! Data set: Enriched network flows (application and network
behavior)
! Goals:
○ Provide real-time visibility for capacity planning, traffic
engineering, resource optimization, revenue leak detection
○ Proactively identify and rapidly resolve issues
Stream Processing
35

Spark Streaming vs Kafka Streams
37

38https://www.cuelogic.com/blog/analyzing-data-streaming-using-spark-vs-kafka

39

#4 Log Aggregation for real time search
with Elasticsearch
40

What is ElasticSearch?
41
Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on
Apache Lucene. Since its release in 2010, Elasticsearch has quickly become the most popular
search engine, and is commonly used for log analytics, full-text search, security
intelligence, business analytics, and operational intelligence use cases.

Log Analytics
42https://hackernoon.com/distributed-log-analytics-using-apache-kafka-kafka-connect-and-ﬂuentd-303330e478af
“Logs… are the heartbeats of our tech stack. They give us insight
into how users interact with us. They provider real time application
intelligence. For that reason we built a robust set of data
infrastructure that can handle large volume of logs from all our
applications, and allow for real time search as well as batch
processing.”

Log Analytics
43https://hackernoon.com/distributed-log-analytics-using-apache-kafka-kafka-connect-and-ﬂuentd-303330e478af

#5 Website Activity Tracking with Apache
Druid
44

What is Druid?
45
high performance
analytics data store for
event-driven data

48Confidential. Do not redistribute.
Search
platform
OLAP
! Real-time ingestion
! Flexible schema
! Full text search
! Batch ingestion
! Efficient storage
! Fast analytic queries
Timeseries
database
! Optimized storage for
time-based datasets
! Time-based functions

Data warehouses
Tightly coupled architecture with limited flexibility.
49
Data
Data
Data
Data Sources
ETL Data
warehouse
Processing Store and Compute
Analytics
Reporting
Data mining
Querying

Data lakes
Modern data architectures are more application-centric.
Data
Data
Data
Data Sources
MapReduce, Spark Apps
ETL
SQL
ML/AI
TSDB
Data
lake
Storage

Data rivers
Streaming architectures are true-to-life and enable faster decision cycles.
Data
Data
Data
Data Sources
Stream processors
Stream
hub
Streaming
analytics
Databases
ETL
Storage
Apps
Archive to
data lake

Website Activity Monitoring
52https://imply.io/post/clickstream-analysis-open-source-divolte-kafka-druid
• Is our campaign working, right now?
• Are we getting more visitors today than we did this time last week?
• Is now a good time to publish content changes targeted at a particular geography?
• Should we target adverts to a different referring website today?
• Have yesterday’s SSO changes made that impact we’ve been looking for?
• How has the release of a new browser this week affected our customer proﬁle? Do we need
to adapt our website code?

Website Activity Monitoring
53https://imply.io/post/clickstream-analysis-open-source-divolte-kafka-druid

Druid vs Cassandra
56https://imply.io/post/apache-cassandra-vs-apache-druid
Always On, Fast, Scalable Applications on single partition reads
vs
Low latency OLAP and AdHoc queries over entire datasets
Need seamless multi data center replication?
Are query patterns adhoc or unknown?
Need to do full table scans?
Have more writes than reads and need millisecond response time?

InfluxDB vs Druid
57https://imply.io/post/apache-druid-vs-time-series-databases
Fast timeseries reads and writes
vs
distributed OLAP style analytics
Simple aggregations / counters?
Group on non-time based tags or attributes?
slice and dice on your metrics arbitrarily?
Need a single node or have a small amount of data?

Spark Streaming vs Druid
58
Stream processing
vs
fast SQL queries on historical and real time data
Need the full power of Scala to do transformations?
Want to query real time and historical data?
Looking for tiered storage and automatic backups?
Want to push results to other systems directly?

Druid vs Elasticsearch
59
Search vs Analytics
Complex AdHoc Queries?
Text prediction?
Aggregation at ingestion?
Totally unstructured data?

Final Thoughts
60Photo taken by me on my iPhone @Candytopia SF

Stay in touch
61
@druidio
Join the community!
http://druid.apache.org/
rachel@imply.io
Follow the Druid project on Twitter!

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedreschi, Imply Data) Kafka Summit SF 2019

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedreschi, Imply Data) Kafka Summit SF 2019

Similar a Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedreschi, Imply Data) Kafka Summit SF 2019 (20)

Más de confluent

Más de confluent (20)

Último

Último (20)

Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedreschi, Imply Data) Kafka Summit SF 2019