Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
1. Streaming Analytics with Spark,
Kafka, Cassandra, and Akka
Helena Edelson
VP of Product Engineering @Tuplejump
2. • Committer / Contributor: Akka, FiloDB, Spark Cassandra
Connector, Spring Integration
• VP of Product Engineering @Tuplejump
• Previously: Sr Cloud Engineer / Architect at VMware,
CrowdStrike, DataStax and SpringSource
Who
@helenaedelson
github.com/helena
3. Tuplejump Open Source
github.com/tuplejump
• FiloDB - distributed, versioned, columnar analytical db for modern
streaming workloads
• Calliope - the first Spark-Cassandra integration
• Stargate - Lucene indexer for Cassandra
• SnackFS - HDFS-compatible file system for Cassandra
3
4. What Will We Talk About
• The Problem Domain
• Example Use Case
• Rethinking Architecture
– We don't have to look far to look back
– Streaming
– Revisiting the goal and the stack
– Simplification
6. The Problem Domain
Need to build scalable, fault tolerant, distributed data
processing systems that can handle massive amounts of
data from disparate sources, with different data structures.
6
7. Translation
How to build adaptable, elegant systems
for complex analytics and learning tasks
to run as large-scale clustered dataflows
7
8. How Much Data
Yottabyte = quadrillion gigabytes or septillion
bytes
8
We all have a lot of data
• Terabytes
• Petabytes...
https://en.wikipedia.org/wiki/Yottabyte
9. Delivering Meaning
• Deliver meaning in sec/sub-sec latency
• Disparate data sources & schemas
• Billions of events per second
• High-latency batch processing
• Low-latency stream processing
• Aggregation of historical from the stream
10. While We Monitor, Predict & Proactively Handle
• Massive event spikes
• Bursty traffic
• Fast producers / slow consumers
• Network partitioning & Out of sync systems
• DC down
• Wait, we've DDOS'd ourselves from fast streams?
• Autoscale issues
– When we scale down VMs how do we not loose data?
13. 13
• Track activities of international threat actor groups,
nation-state, criminal or hactivist
• Intrusion attempts
• Actual breaches
• Profile adversary activity
• Analysis to understand their motives, anticipate actions
and prevent damage
Adversary Profiling & Hunting
14. 14
• Machine events
• Endpoint intrusion detection
• Anomalies/indicators of attack or compromise
• Machine learning
• Training models based on patterns from historical data
• Predict potential threats
• profiling for adversary Identification
•
Stream Processing
15. Data Requirements & Description
• Streaming event data
• Log messages
• User activity records
• System ops & metrics data
• Disparate data sources
• Wildly differing data structures
15
16. Massive Amounts Of Data
16
• One machine can generate 2+ TB per day
• Tracking millions of devices
• 1 million writes per second - bursty
• High % writes, lower % reads
• TTL
21. Streaming
I need fast access to historical data on the fly for
predictive modeling with real time data from the stream.
21
22. Not A Stream, A Flood
• Data emitters
• Netflix: 1 - 2 million events per second at peak
• 750 billion events per day
• LinkedIn: > 500 billion events per day
• Data ingesters
• Netflix: 50 - 100 billion events per day
• LinkedIn: 2.5 trillion events per day
• 1 Petabyte of streaming data
22
26. Strategies
• Partition For Scale & Data Locality
• Replicate For Resiliency
• Share Nothing
• Fault Tolerance
• Asynchrony
• Async Message Passing
• Memory Management
26
• Data lineage and reprocessing in
runtime
• Parallelism
• Elastically Scale
• Isolation
• Location Transparency
27. AND THEN WE GREEKED OUT
27
Rethinking Architecture
28. Lambda Architecture
A data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and
stream processing methods.
28
29. Lambda Architecture
A data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and
stream processing methods.
• An approach
• Coined by Nathan Marz
• This was a huge stride forward
29
32. Implementing Is Hard
32
• Real-time pipeline backed by KV store for updates
• Many moving parts - KV store, real time, batch
• Running similar code in two places
• Still ingesting data to Parquet/HDFS
• Reconcile queries against two different places
36. Which Translates To
• Performing analytical computations & queries in dual
systems
• Implementing transformation logic twice
• Duplicate Code
• Spaghetti Architecture for Data Flows
• One Busy Network
36
37. Why Dual Systems?
• Why is a separate batch system needed?
• Why support code, machines and running services of
two analytics systems?
37
Counter productive on some level?
38. YES
38
• A unified system for streaming and batch
• Real-time processing and reprocessing
• Code changes
• Fault tolerance
http://radar.oreilly.com/2014/07/questioning-the-lambda-
architecture.html - Jay Kreps
40. Extract, Transform, Load (ETL)
40
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
41. Extract, Transform, Load (ETL)
41
ETL involves
• Extraction of data from one system into another
• Transforming it
• Loading it into another system
42. Extract, Transform, Load (ETL)
"Designing and maintaining the ETL process is often
considered one of the most difficult and resource-
intensive portions of a data warehouse project."
http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm
42
Also unnecessarily redundant and often typeless
43. ETL
43
• Each ETL step can introduce errors and risk
• Can duplicate data after failover
• Tools can cost millions of dollars
• Decreases throughput
• Increased complexity
48. Removing The 'E' in ETL
Thanks to technologies like Avro and Protobuf we don’t need the
“E” in ETL. Instead of text dumps that you need to parse over
multiple systems:
Scala & Avro (e.g.)
• Can work with binary data that remains strongly typed
• A return to strong typing in the big data ecosystem
48
49. Removing The 'L' in ETL
If data collection is backed by a distributed messaging
system (e.g. Kafka) you can do real-time fanout of the
ingested data to all consumers. No need to batch "load".
• From there each consumer can do their own transformations
49
55. Spark Streaming
• One runtime for streaming and batch processing
• Join streaming and static data sets
• No code duplication
• Easy, flexible data ingestion from disparate sources to
disparate sinks
• Easy to reconcile queries against multiple sources
• Easy integration of KV durable storage
55
56. How do I merge historical data with data
in the stream?
56
57. Join Streams With Static Data
val ssc = new StreamingContext(conf, Milliseconds(500))
ssc.checkpoint("checkpoint")
val staticData: RDD[(Int,String)] =
ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func)
val stream: DStream[(Int,String)] =
KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n))
.transform { events => events.join(staticData))
.saveToCassandra(keyspace,table)
ssc.start()
57
59. Spark Streaming & ML
59
val context = new StreamingContext(conf, Milliseconds(500))
val model = KMeans.train(dataset, ...) // learn offline
val stream = KafkaUtils
.createStream(ssc, zkQuorum, group,..)
.map(event => model.predict(event.feature))
60. Apache Mesos
Open-source cluster manager developed at UC Berkeley.
Abstracts CPU, memory, storage, and other compute resources
away from machines (physical or virtual), enabling fault-tolerant
and elastic distributed systems to easily be built and run
effectively.
60
61. Akka
High performance concurrency framework for Scala and
Java
• Fault Tolerance
• Asynchronous messaging and data processing
• Parallelization
• Location Transparency
• Local / Remote Routing
• Akka: Cluster / Persistence / Streams
61
62. Akka Actors
A distribution and concurrency abstraction
• Compute Isolation
• Behavioral Context Switching
• No Exposed Internal State
• Event-based messaging
• Easy parallelism
• Configurable fault tolerance
62
64. import akka.actor._
class NodeGuardianActor(args...) extends Actor with SupervisorStrategy {
val temperature = context.actorOf(
Props(new TemperatureActor(args)), "temperature")
val precipitation = context.actorOf(
Props(new PrecipitationActor(args)), "precipitation")
override def preStart(): Unit = { /* lifecycle hook: init */ }
def receive : Actor.Receive = {
case Initialized => context become initialized
}
def initialized : Actor.Receive = {
case e: SomeEvent => someFunc(e)
case e: OtherEvent => otherFunc(e)
}
}
64
65. Apache Cassandra
• Extremely Fast
• Extremely Scalable
• Multi-Region / Multi-Datacenter
• Always On
• No single point of failure
• Survive regional outages
• Easy to operate
• Automatic & configurable replication 65
66. Apache Cassandra
• Very flexible data modeling (collections, user defined
types) and changeable over time
• Perfect for ingestion of real time / machine data
• Huge community
66
67. Spark Cassandra Connector
• NOSQL JOINS!
• Write & Read data between Spark and Cassandra
• Compatible with Spark 1.4
• Handles Data Locality for Speed
• Implicit type conversions
• Server-Side Filtering - SELECT, WHERE, etc.
• Natural Timeseries Integration
67
http://github.com/datastax/spark-cassandra-connector
68. KillrWeather
68
http://github.com/killrweather/killrweather
A reference application showing how to easily integrate streaming and
batch data processing with Apache Spark Streaming, Apache
Cassandra, Apache Kafka and Akka for fast, streaming computations
on time series data in asynchronous event-driven environments.
http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/
databricks/apps/weather
69. 69
• High Throughput Distributed Messaging
• Decouples Data Pipelines
• Handles Massive Data Load
• Support Massive Number of Consumers
• Distribution & partitioning across cluster nodes
• Automatic recovery from broker failures
70. Spark Streaming & Kafka
val context = new StreamingContext(conf, Seconds(1))
val wordCount = KafkaUtils.createStream(context, ...)
.flatMap(_.split(" "))
.map(x => (x, 1))
.reduceByKey(_ + _)
wordCount.saveToCassandra(ks,table)
context.start() // start receiving and computing
70
71. 71
class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext)
extends AggregationActor(settings: Settings) {
import settings._
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2)
.map(_._2.split(","))
.map(RawWeatherData(_))
kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)
/** RawWeatherData: wsid, year, month, day, oneHourPrecip */
kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip))
.saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)
/** Now the [[StreamingContext]] can be started. */
context.parent ! OutputStreamInitialized
def receive : Actor.Receive = {…}
}
Gets the partition key: Data Locality
Spark C* Connector feeds this to Spark
Cassandra Counter column in our schema,
no expensive `reduceByKey` needed. Simply
let C* do it: not expensive and fast.
72. 72
/** For a given weather station, calculates annual cumulative precip - or year to date. */
class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor {
def receive : Actor.Receive = {
case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender)
case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender)
}
/** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */
def cumulative(wsid: String, year: Int, requester: ActorRef): Unit =
ssc.cassandraTable[Double](keyspace, dailytable)
.select("precipitation")
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync()
.map(AnnualPrecipitation(_, wsid, year)) pipeTo requester
/** Returns the 10 highest temps for any station in the `year`. */
def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = {
val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year,
ssc.sparkContext.parallelize(aggregate).top(k).toSeq)
ssc.cassandraTable[Double](keyspace, dailytable)
.select("precipitation")
.where("wsid = ? AND year = ?", wsid, year)
.collectAsync().map(toTopK) pipeTo requester
}
}
73. A New Approach
• One Runtime: streaming, scheduled
• Simplified architecture
• Allows us to
• Write different types of applications
• Write more type safe code
• Write more reusable code
73
74. Need daily analytics aggregate reports? Do it in the stream, save
results in Cassandra for easy reporting as needed - with data
locality not offered by S3.
75. FiloDB
Distributed, columnar database designed to run very fast
analytical queries
• Ingest streaming data from many streaming sources
• Row-level, column-level operations and built in versioning
offer greater flexibility than file-based technologies
• Currently based on Apache Cassandra & Spark
• github.com/tuplejump/FiloDB
75
76. FiloDB
• Breakthrough performance levels for analytical queries
• Performance comparable to Parquet
• One to two orders of magnitude faster than Spark on
Cassandra 2.x
• Versioned - critical for reprocessing logic/code changes
• Can simplify your infrastructure dramatically
• Queries run in parallel in Spark for scale-out ad-hoc analysis
• Space-saving techniques
76